巴西专利BR112019027609A2 method implemented in a neural network of training of a splice site detector that identifies splice

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The disclosed technology refers to the construction of a classifier based on convolutional neural network for classification of variants. In particular, it refers to training a classifier based on convolutional neural network in training data using a gradient update technique based on backpropagation that progressively combines the outputs of the classifier based on convolutional neural network with corresponding ground truth markers. The classifier based on convolutional neural network comprises groups of residual blocks, each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a size of the convolution window of the residual blocks and an atrous convolution rate of the residual blocks, the size of the convolution window varies between groups of residual blocks, the atrous convolution rate varies between groups of residual blocks. The training data includes benign training examples and pathogenic training examples of translated sequence pairs generated from benign and pathogenic variants.
公开号:BR112019027609A2
申请号:R112019027609-2
申请日:2018-10-15
公开日:2020-07-21
发明作者:Kishore JAGANATHAN；Kai-How FARH；Sofia Kyriazopoulou Panagiotopoulou；Jeremy Francis McRAE
申请人:Illumina, Inc.；
IPC主号:

专利说明:

[0001] [0001] The Appendix includes a bibliography of potentially relevant references listed in an article by the inventors. The subject of the document is addressed in the US Provisions to which this claim claims priority to / benefit from. These references can be made available by the Board upon request or can be accessed through the Global Dossier. PRIORITY REQUESTS
[0002] [0002] This application claims priority to or benefit from US Provisional Patent Application No. 62 / 573,125, entitled “Deep Learning-Based Splice Site Classification”, by Kishore Jaganathan, Kai-How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, deposited on October 16, 2017 (Attorney Registration No. ILLM 1001-1 / IP-1610- PRV); US Provisional Patent Application No. 62 / 573,131, entitled "Deep Learning-Based Aberrant Splicing Detection", by Kishore Jaganathan, Kai- How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, filed on October 16, 2017 (Registration No. Attorney ILLM 1001-2 / IP-1614-PRV); US Provisional Patent Application No. 62 / 573,135, entitled “Aberrant Splicing Detection Using Convolutional Neural Networks (CNNs)”, by Kishore Jaganathan, Kai-How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, filed on October 16, 2017 ( Attorney Registration No. ILLM 1001-3 / IP-1615-PRV); and US Provisional Patent Application No. 62 / 726,158, entitled “Predicting Splicing from Primary Sequence with Deep Learning”, by Kishore Jaganathan, Kai-How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, filed on August 31, 2018 (No. ILLM Attorney Registration
[0003] [0003] The following are incorporated by reference for all purposes, as if fully established in this document:
[0004] [0004] PCT Patent Application No. PCT / US18 /, entitled “Deep Learning-Based Aberrant Splicing Detection”, by Kishore Jaganathan, Kai-How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, filed on October 15, 2018 (Attorney Registration No. ILLM 1001-8 / IP-1614-PCT), later published as PCT Publication No. WO.
[0005] [0005] PCT Patent Application No. PCT / US18 /; titled “Aberrant Splicing Detection Using Convolutional Neural Networks (CNNs)”, by Kishore Jaganathan, Kai-How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, deposited on October 15, 2018 (ILLM Attorney Registration No. 1001- 9 / IP-1615-PCT), later published as PCT Publication No. WO.
[0006] [0006] US Non-Provisional Patent Application, entitled "Deep Learning-Based Splice Site Classification", by Kishore Jaganathan, Kai-How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, filed (ILLM Attorney Registration No. 1001-4 / IP-1610-US) deposited in a contemporary manner.
[0007] [0007] US Non-Provisional Patent Application entitled “Deep Learning-Based Aberrant Splicing Detection” by Kishore Jaganathan, Kai- How Farh, Sofia Kyriazopoulou Panagiotopoulou and Jeremy Francis McRae, (Attorney Registration No. ILLM 1001-5 / IP- 1614-US) filed in a contemporary manner.
[0008] [0008] US Non-Provisional Patent Application titled “Aberrant Splicing Detection Using Convolutional Neural Networks (CNNs)” by Kishore
[0009] [0009] “Document 1 - S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“ WAVENET: A GENERATIVE MODEL FOR RAW AUDIO / arXiv: 1609.03499, 2016;
[0010] [0010] Document 2-S.06Ô. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta and M. Shoeybi, “DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH, "arXiv: 1702.07825, 2017;
[0011] [0011] Documento3-F.YueV.Koltun, “MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS”, arXiv: 1511.07122, 2016;
[0012] [0012] Document 4 - K. He, X. Zhang, S. Ren and J. Sun, “DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION”, arXiv: 1512.03385, 2015;
[0013] [0013] “Document 5 -R.K. Srivastava, K. Greff and J. Schmidhuber, “HIGHWAY NETWORKS”, arXiv: 1505.00387, 2015;
[0014] [0014] Document 6 -G.Huang, Z. Liu, L. van der Maaten and K.Q. Weinberger, “DENSELY CONNECTED CONVOLUTIONAL NETWORKS”, arXiv: 1608.06993, 2017;
[0015] [0015] “Document 7 -C.Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich,“ GOING DEEPER WITH CONVOLUTIONS ”, arXiv: 1409.4842, 2014;
[0016] [0016] Document 8 - S. loffe and C. Szegedy, “BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT”, arXiv: 1502.03167, 2015;
[0017] [0017] Document 9 -J.M. Wolterink, T. Leiner, M.A. Viergever and |. légum, “DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART
[0018] [0018] “Document 10 - L.C. Piqueras,“ AUTOREGRESSIVE
[0019] [0019] Document 11-J. Wu, “Introduction to Convolutional Neural Networks”, Nanjing University, 2017;
[0020] [0020] Document 12 -1.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Bengio, CONVOLUTIONAL NETWORKS ”, Deep Learning, MIT Press, 2016; and
[0021] [0021] Document 13 - J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang and G. Wang, “RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS” , arXiv: 1512.07108, 2017.
[0022] [0022] Document 1 describes deep convolutional neural network architectures that use residual block groups with convolution filters with the same convolution window size, batch normalization layers, rectified linear unit layers (abbreviated ReLU), layers of dimensionality change, atrous convolution layers with exponentially growing atrous convolution rates, skip connections and a softmax rating layer to accept an input sequence and produce an output sequence that punctuates the inputs in the input sequence. The disclosed technology uses components and parameters of the neural network described in Document 1. In one implementation, the disclosed technology modifies the parameters of the components of the neural network described in Document 1. For example, unlike Document 1, the convolution rate is atrous in the disclosed technology progresses non-exponentially from a group of lower residual blocks to a group of higher residual blocks. In another example, unlike Document 1, the size of the convolution window in the disclosed technology varies between groups of residual blocks.
[0023] [0023] Document 2 describes details of the deep convolutional neural network architectures described in Document 1.
[0024] [0024] Document 3 describes the atrocious convolutions used by the disclosed technology. As used in this document, atrous convolutions are also referred to as "dilated convolutions". Atrous / dilated convolutions allow for large receptive fields with few trainable parameters. An atrous / dilated convolution is a convolution in which the nucleus is applied over an area greater than its length, skipping the input values with a certain step, also called atrous convolution rate or dilation factor. Atrous / dilated convolutions add spacing between the elements of a convolution filter / nucleus, so that neighboring inputs (eg nucleotides, amino acids) at longer intervals are considered when a convolution operation is performed. This allows the incorporation of long-range contextual dependencies into the entry. Atrous convolutions retain partial convolution calculations for reuse as the adjacent nucleotides are processed.
[0025] [0025] Document 4 describes residual blocks and residual connections used by the disclosed technology.
[0026] [0026] Document 5 describes the skip connections used by the disclosed technology. As used in this document, skip connections are also referred to as "road networks".
[0027] [0027] Document 6 describes densely connected convolutional network architectures used by the disclosed technology.
[0028] [0028] Document 7 describes the convolution layers that change the dimensionality and the processing pipelines based on modules used by the disclosed technology. An example of a convolution that alters dimensionality is a 1 x 1 convolution.
[0029] [0029] Document 8 describes the batch normalization layers used by the disclosed technology.
[0030] [0030] Document 9 also describes atrous / dilated convolutions used by the disclosed technology.
[0031] [0031] Document 10 describes several deep neural network architectures that can be used by the disclosed technology, including convolutional neural networks, deep convolutional neural networks and deep convolutional neural networks with atrous / dilated seizures.
[0032] [0032] Document 11 describes details of a convolutional neural network that can be used by the disclosed technology, including algorithms to train a convolutional neural network with subsampling layers (for example, pooling) and fully connected layers.
[0033] [0033] Document 12 describes details of various convolution operations that can be used by the disclosed technology.
[0034] [0034] Document 13 describes several convolutional neural network architectures that can be used by the disclosed technology. INCORPORATION BY REFERENCE OF SUBMITTED TABLES ELECTRONICALLY WITH ORDER
[0035] [0035] The following table files in ASCII text format are submitted with this request and incorporated by reference. The names, creation dates and file sizes are:
[0036] [0036] table S4 mutation rates.txt August 31, 2018 2,452 KB
[0037] [0037] table S5 gene enrichment.txt 31 August 2018 362 KB
[0038] [0038] table S6 validation.txt August 31, 2018 362 KB FIELD OF TECHNOLOGY DISCLOSED
[0039] [0039] The disclosed technology refers to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for and products for intelligence emulation (ie, knowledge based systems, reasoning systems and knowledge acquisition systems); and including uncertain reasoning systems (for example, fuzzy logic systems), adaptive systems, machine learning systems and artificial neural networks. In particular, the technology disclosed refers to the use of techniques based on deep learning for training deep convolutional neural networks. FUNDAMENTALS
[0040] [0040] The subject discussed in this section should not be considered as state of the art only as a result of its mention in this section. Likewise, it should not be assumed that a problem mentioned in this section or associated with the subject provided as a basis has been previously recognized in the prior art. The subject of this section represents only different approaches, which in themselves can also correspond to implementations of the claimed technology.
[0041] [0041] In machine learning, input variables are used to predict an output variable. Input variables are often called resources and are indicated by X = (X1, DG, ..., Xx, where each X, iE1, .., k is a resource. The output variable is often called the response variable or dependent and is indicated by the variable Y ;. The relationship between Y and the corresponding X can be written in general: v = flee
[0042] [0042] In the above equation, f is a function of resources (X1, X2, ..., XX.) And E is the term of random error. The error term is independent of X and has an average value of zero.
[0043] [0043] In practice, resources X are available without Y or without knowing the exact relationship between X and Y. Since the error term has an average value of zero, the objective is to estimate f.
[0044] [0044] In the above equation,; is the estimate of E, which is often considered a black box, which means that only the relationship between the input and output of; is known, but the question of why it works remains unanswered.
[0045] [0045] The function; is found using learning. Supervised learning and unsupervised learning are two ways used in machine learning for this task. In supervised learning, labeled data is used for training. When showing the corresponding inputs and outputs (= markers), function 7; is optimized to approach the exit. In unsupervised learning, the goal is to find a hidden structure from unmarked data. The algorithm has no measure of accuracy in the input data, which distinguishes it from supervised learning.
[0046] [0046] The single layer perceptron (SLP) is the simplest model of a neural network. It comprises an input layer and an activation function, as shown in FIGURE 1. The inputs are passed through the weighted graph. The function fuses the sum of the entries as an argument and compares it with an O limit.
[0047] [0047] FIGURE 2 represents an implementation of a fully connected multi-layered neural network. A neural network is a system of interconnected artificial neurons (for example, a1, a2, a3) that exchange messages with each other. The illustrated neural network has three inputs, two neurons in the hidden layer and two neurons in the output layer. The hidden layer has an activation function f (º) and the output layer has an activation function £ $ (º). The connections have numerical weights (for example, Ww11, W21, W12, W31, W22, W32, V11, V22) that are adjusted during the training process, so that a properly trained network responds correctly when being fed an image to recognize. The input layer processes the raw input, the hidden layer processes the input layer output based on the weights of the connections between the input layer and the hidden layer. The output layer takes the output from the hidden layer and processes it based on the weights of the connections between the hidden layer and the output layer. The network includes several layers of resource-detecting neurons. Each layer has many neurons that respond to different combinations of input from the previous layers. These layers are constructed so that the first layer detects a set of primitive patterns in the input image data, the second layer detects patterns of patterns and the third layer detects patterns of those patterns.
[0048] [0048] A survey of the application of deep learning in genomics can be found in the following publications: * T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org: 142760, 2017; + Angermueller C, Páârnamaa T, Parts L, Stegle O. Deep Learning For Computational Biology. Mol Syst Biol. 2016; 12: 878; + Park Y, Kellis M. 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825—826. (doi: 10.1038 / nbt.3313); * Min, S., Lee, B. & Yoon, S. Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016); + Leung MK, Delong A, Alipanahi B et al. Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets 2016; and + Libbrecht MW, Noble WS. Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015; 16 (6): 321-32. BRIEF DESCRIPTION OF THE FIGURES
[0049] [0049] In the figures, similar reference characters generally refer to similar parts across different views. In addition, the figures are not necessarily to scale, with an emphasis being generally placed on illustrating the principles of the disclosed technology. In the following description, several implementations of the disclosed technology are described with reference to the following figures, in which:
[0050] [0050] FIGURE 1 represents a single layer perceptron (SLP).
[0051] [0051] FIGURE 2 represents an implementation of a multilayer feed-forward neural network.
[0052] [0052] FIGURE 3 represents a functioning implementation of a convolutional neural network.
[0053] [0053] FIGURE 4 represents a diagram of training blocks of a convolutional neural network according to an implementation of the disclosed technology.
[0054] [0054] FIGURE 5 represents an implementation of a non-linear ReLU layer according to an implementation of the disclosed technology.
[0055] [0055] FIGURE 6 illustrates dilated convolutions.
[0056] [0056] FIGURE 7 is an implementation of subsampling layers (average / maximum pool) according to an implementation of the disclosed technology.
[0057] [0057] AFIGURA 3 depicts an implementation of a two-layer convolution of the convolution layers.
[0058] [0058] FIGURE 9 depicts a residual connection that reinjusts previous information downstream by adding the resource map.
[0059] [0059] FIGURE 10 represents an implementation of residual blocks and skip connections.
[0060] [0060] FIGURE 11 represents an implementation of stacked dilated convolutions.
[0061] [0061] FIGURE 12 represents the batch normalization forward pass.
[0062] [0062] FIGURE 13 illustrates the transformation of batch normalization in the test time.
[0063] [0063] FIGURE 14 represents the backward pass of the batch normalization.
[0064] [0064] AFIGURA 15 depicts the use of a batch normalization layer with a convolutional or densely connected layer.
[0065] [0065] FIGURE 16 represents an implementation of 1d convolution.
[0066] [0066] FIGURE 17 illustrates how the global average pool (GAP) works.
[0067] [0067] FIGURE 18 illustrates an implementation of a computing environment with training servers and production servers that can be used to implement the disclosed technology.
[0068] [0068] —FIGURE 19 represents an architecture implementation of an atrous convolutional neural network (abbreviated ACNN), referred to in this document as "SpliceNet".
[0069] [0069] FIGURE 20 represents an implementation of a residual block that can be used by ACNN and a convolutional neural network (abbreviated CNN).
[0070] [0070] FIGURE 21 represents another implementation of the ACNN architecture, referred to in this document as "SpliceNet80".
[0071] [0071] FIGURE 22 represents yet another implementation of the ACNN architecture, referred to in this document as "SpliceNet400".
[0072] [0072] FIGURE 23 represents yet another implementation of the ACNN architecture, referred to in this document as "SpliceNet2000".
[0073] [0073] FIGURE 24 represents another implementation of the ACNN architecture, referred to in this document as "SpliceNet10000".
[0074] [0074] FIGURES 25, 26 and 27 represent various types of entries processed by ACNN and CNN.
[0075] [0075] FIGURE 28 shows that ACNN can be trained in at least 800 million sites without splicing and CNN can be trained in at least 1 million sites without splicing.
[0076] [0076] FIGURE 29 illustrates a one-hot encoder.
[0077] [0077] AFIGURA 30 depicts ACNN training.
[0078] [0078] FIGURE 31 represents a CNN.
[0079] [0079] FIGURE 32 represents the training, validation and testing of ACNN and CNN.
[0080] [0080] FIGURE 33 depicts a reference sequence and an alternative sequence.
[0081] [0081] FIGURE 34 illustrates the detection of aberrant splicing.
[0082] [0082] FIGURE 35 illustrates the SpliceNet10000 processing pyramid for splice site classification.
[0083] [0083] FIGURE 36 depicts the SpliceNet10000 processing pyramid for detecting aberrant splicing.
[0084] [0084] FIGURES 37A, 37B, 37C, 37D, 37E, 37F, 37Ge 37H illustrate a primary sequence splicing prediction implementation with deep learning.
[0085] [0085] FIGURES 38A, 38B, 38C, 38D, 38E, 38Fe 38G depict an implementation of the validation of rare cryptic splice mutations in RNA-sec data.
[0086] [0086] FIGURES 39A, 39B and 39C represent an implementation of cryptic splice variants that often create alternative tissue-specific splicing.
[0087] [0087] FIGURES 40A, 40B, 40C, 40De 40E depict an implementation of predicted cryptic splice variants that are strongly harmful to human populations.
[0088] [0088] FIGURES 41A, 41B, 41C, 41D, 41E and 41F represent an implementation of de novo cryptic splice mutations in patients with rare genetic disease.
[0089] [0089] FIGURES 42A and 42B depict an evaluation of various splicing prediction algorithms in lincRNAs.
[0090] [0090] FIGURES 43A and 43B illustrate position-dependent effects of the TACTAAC branch point and GAAGAA exon splice enhancing motifs.
[0091] [0091] FIGURES 44A and 44B represent effects of the positioning of the nucleosome in splicing.
[0092] [0092] FIGURE 45 illustrates an example of calculating the effect size for a splice interrupt variant with complex effects.
[0093] [0093] FIGURES 46A, 46B and 46C show an evaluation of the SpliceNet-10k model in singleton and common variants.
[0094] [0094] FIGURES 47A and 47B depict the validation rate and effect sizes of splice site creation variants, divided by the location of the variant.
[0095] [0095] FIGURES 48A, 48B, 49C and 49D depict the evaluation of the SpliceNet-10k model on training and test chromosomes.
[0096] [0096] FIGURES 49A, 49B and 49C illustrate de novo cryptic splice mutations in patients with rare genetic disease, only from sites in synonymous, intronic or untranslated regions.
[0097] [0097] ASFIGURES 50A and 50B represent cryptic de novo splice mutations in ASD and as a proportion of pathogenic DNMs.
[0098] [0098] FIGURES 51A, 51B, 51C, 51D, 51E, 51F, 51G, 51H, 511 and 51J depict the validation of RNA-seq for predicted cryptic splice mutations in ASD patients.
[0099] [0099] FIGURES 52A and 52B illustrate the validation rate and sensitivity in RNA-seq of a model trained only in canonical transcripts.
[00100] [00100] FIGURES 53A, 53B and 53C illustrate that joint modeling improves the performance of the SpliceNet-10k.
[00101] [00101] FIGURES 54A and 54B represent the evaluation of SpliceNet-10k in regions with variable exon density.
[00102] [00102] FIGURE 55 is Table S1, which depicts an implementation of GTEx samples used to demonstrate effect size calculations and tissue-specific splicing.
[00103] [00103] FIGURE 56 is Table S2, which depicts an implementation of cutoff points used to assess the validation rate and sensitivity of different algorithms.
[00104] [00104] FIGURE 57 represents an implementation of gene enrichment analysis.
[00105] [00105] FIGURE 58 represents an implementation of enrichment analysis in the entire genome.
[00106] [00106] AFIGURA 59 is a simplified block diagram of a computer system that can be used to implement the disclosed technology. DETAILED DESCRIPTION
[00107] [00107] The following discussion is presented to allow anyone skilled in the art to make and use the disclosed technology, and is provided in the context of a particular order and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined in this document can be applied to other implementations and requests without departing from the spirit and scope of the disclosed technology. Thus, the technology disclosed is not intended to be limited to the implementations presented, but should be given the broadest scope consistent with the principles and characteristics disclosed in this document.
[00108] [00108] A convolutional neural network is a special type of neural network. The fundamental difference between a densely connected layer and a convolution layer is as follows: the dense layers learn global patterns in their input resource space, while the convolution layers learn local patterns: in the case of images, the patterns found in small 2D windows of entries. This key feature provides convolutional neural networks with two interesting properties: (1) the patterns they learn are invariant to translation and (2) they can learn spatial hierarchies of patterns.
[00109] [00109] Regarding the first, after learning a certain pattern in the lower right corner of the image, a convolution layer can recognize it anywhere: for example, in the upper left corner. A densely connected network would have to learn the pattern again if it appeared in a new location. This makes the data from convolutional neural networks efficient because they need fewer training samples to learn representations since they have generalizing power.
[00110] [00110] In relation to the second, a first convolution layer can learn small local patterns, such as edges, a second convolution layer will learn larger patterns made with the resources of the first layers, and so on. This allows convolutional neural networks to efficiently learn increasingly complex and abstract visual concepts.
[00111] [00111] A convolutional neural network learns highly nonlinear mappings interconnecting layers of artificial neurons arranged in many different layers with activation functions that make the layers dependent. This includes one or more convolutional layers, interspersed with one or more subsampling layers and non-linear layers, which are usually followed by one or more fully connected layers. Each element of the convolutional neural network receives input from a set of resources in the previous layer. The convolutional neural network learns simultaneously, because neurons in the same resource map have identical weights. These local shared weights reduce the complexity of the network, so that when multidimensional input data enters the network, the convolutional neural network avoids the complexity of data reconstruction in the classification or resource extraction and regression process.
[00112] [00112] Convolutions operate on 3D tensors, called resource maps, with two spatial axes (height and width), in addition to a depth axis (also called a channel axis). For an RGB image, the depth axis dimension is 3, because the image has three color channels; red, green and blue. For a black and white image, the depth is 1 (gray levels). The convolution operation extracts patches from your input resource map and applies the same transformation to all of these patches, producing an output resource map. This output resource map is still a 3D tensor: it has width and height. Its depth can be arbitrary, because the output depth is a layer parameter and the different channels on that depth axis no longer represent specific colors as in the RGB input; instead, they represent filters. Filters encode specific aspects of the input data: at a height level, a single filter can encode the concept "presence of a face at the entrance", for example.
[00113] [00113] For example, the first convolution layer takes a resource size map (28, 28, 1) and generates a resource size map (26, 26, 32): it calculates 32 filters on its input. Each of these 32 output channels contains a 26 x 26 grid of values, which is a filter response map over the input, indicating the response of that filter pattern at different locations in the input. This is what the term resource map means: every dimension on the depth axis is a resource (or filter) and the 2D tensor output [:,:, n) is the 2D spatial map of that filter's response over the input.
[00114] [00114] Convolutions are defined by two main parameters: (1) size of patches extracted from the inputs - usually 1x1.3x3or5x5e (2) depth of the output resource map - the number of filters calculated by the convolution. They usually start with a depth of 32, continue with a depth of 64, and end with a depth of 128 or 256.
[00115] [00115] A convolution works by sliding these 3 x 3 or 5 x 5 size windows over the 3D input resource map, stopping at all locations and extracting the 3D patch from the adjacent resources (shape (window height, window width, input depth)). Each 3D patch is then transformed (using a tensor product with the same learned weight matrix, called the convolution core) into a 1D shape vector (output depth,). All of these vectors are spatially reassembled on a 3D shape output map (height, width, output depth). Each spatial location on the output resource map corresponds to the same location on the input resource map (for example, the lower right corner of the output contains information about the lower right corner of the input). For example, with 3 x 3 windows, the output of the vector [i, | : | comes from the 3D patch entry [1-1: i + 1, j-1: J + 1 ,:]. The complete process is detailed in FIGURE 3.
[00116] [00116] The convolutional neural network comprises convolution layers that perform the convolution operation between the input values and the convolution filters (weight matrix) that are learned over many gradient update iterations during training. Whether (m, n) is the size of the filter and W is the weight matrix, then a convolution layer performs a convolution of W with input X calculating the scalar product W + x + b, where x is an instance of X and b is the bias. The size of the step through which the convolution filters slide through the entrance is called the pass and the filter area (m x n) is called the receptive field. The same convolution filter is applied to different positions of the entrance, which reduces the number of weights learned. It also allows invariant learning of the location, that is, if there is an important pattern at the entrance, convolution filters learn it, no matter where it is in the sequence.
[00117] [00117] AFIGURA 4 depicts a diagram of training blocks of a convolutional neural network according to an implementation of the disclosed technology. The convolutional neural network is adjusted or trained so that the input data leads to a specific output estimate. The convolutional neural network is adjusted using reverse propagation based on a comparison of the output estimate and the ground truth until the output estimate progressively matches or approaches the ground truth.
[00118] [00118] The convolutional neural network is trained by adjusting the weights between neurons based on the difference between the ground truth and the actual output. This is mathematically described as: Aw = x6Ô Where õ = (ground truth) - (real output)
[00119] [00119] In an implementation, the training rule is defined as: Wim - Wim + a (lm = m) An
[00120] [00120] In the above equation: the arrow indicates an update of the value; tm is the target value of neuron 711; à & m is the calculated current output of the neuron M; A is entry 1; is - is the learning rate.
[00121] [00121] The intermediate step in training includes generating a resource vector from the input data using the convolution layers. The gradient with respect to the weights in each layer, starting at the exit, is calculated. This is called a backward pass. The weights in the network are updated using a combination of the negative gradient and the previous weights.
[00122] [00122] In an implementation, the convolutional neural network uses a stochastic gradient update algorithm (such as ADAM) that performs inverse error propagation by descending the gradient. An example of a sigmoid function-based reverse propagation algorithm is described below:
[00123] [00123] In the sigmoid function above, h is the weighted sum calculated by a neuron. The sigmoid function has the following derivative: 99 na Ph = p (1-9)
[00124] [00124] The algorithm includes the calculation of the activation of all neurons in the network, producing an output for the direct passage. The activation of neuron 711 in the hidden layers is described as:
[00125] [00125] This is done for all hidden layers to obtain the activation described as: 9 el. k pet
[00126] [00126] Then, the error and the correct weights are calculated per layer. The output error is calculated as: ox PU PIA P)
[00127] [00127] Oerrronas hidden layers is calculated as:
[00128] [00128] The output layer weights are updated as: Vmk <—Vmk + AÓdokPm
[00129] [00129] The weights of the hidden layers are updated using the learning rate & as: Vim << Wnn + AÔhmAn
[00130] [00130] In an implementation, the convolutional neural network uses gradient drop optimization to calculate the error in all layers. In such optimization, for a resource vector of input x and the predicted output Y, the loss function is defined as / for the cost of predicting 4 when the target is y, that is, I (7, y). The predicted output y is transformed from the input resource vector x using the function f. The function f is parameterized by the weights of the convolutional neural network, that is, y = f., (X). The loss function is described as 1 (9, y) = I (f.v (x), y), or Q (z, w)] =! (f., (x), y) where z is an input and output data pair (x, y). The optimization of the gradient descent is performed by updating the weights according to: 12 V = V-A- 2VmO (2, m) Win1i = Wi + Venl
[00131] [00131] In the equations above, X is the learning rate. In addition, the loss is averaged over a set of 71 data pairs. The calculation is completed when the learning rate X is small enough after linear convergence. In other implementations, the gradient is calculated using only selected data pairs fed to an accelerated Nesterov gradient and an adaptive gradient to inject computational efficiency.
[00132] [00132] In one implementation, the convolutional neural network uses a stochastic gradient (SGD) drop to calculate the cost function. A SGD approximates the gradient with respect to weights in the loss function, calculating it from just one, randomized data pair, Z, described as: Vi + 1 = UV OAVwO (z1, w ") Win1 = Wi + Venl
[00133] [00133] In the above equations: «XY is the learning rate; Me the moment; and Í is the current weight status before the update. The convergence speed of SGD is approximately o (1 / n when the X learning rate is reduced quickly and slowly enough. In other implementations, the convolutional neural network uses different loss functions, such as Euclidean loss and softmax loss. an additional implementation, a stochastic optimizer from Adam is used by the convolutional neural network.
[00134] [00134] The convolution layers of the convolutional neural network serve as resource extractors. The convolution layers act as adaptive resource extractors capable of learning and decomposing input data into hierarchical resources. In an implementation, the convolution layers receive two images as input and produce a third image as an output. In such an implementation, convolution operates on two images in two dimensions (2D), one image being the input image and the other image, called “core”, applied as a filter on the input image, producing an output image. Thus, for an input vector f of length n and a nucleus g of length m, the convolution f * g of fe g is defined as: m St aIN = 280) SG j + m / 2) =
[00135] [00135] The convolution operation includes sliding the core over the input image. For each core position, the overlapping core values and the input image are multiplied and the results are added. The sum of the products is the value of the output image at the point of the input image where the core is centered. The different outputs resulting from many cores are called resource maps.
[00136] [00136] After the convolutional layers are trained, they are applied to perform recognition tasks on new inference data. Since the convolutional layers learn from training data, they avoid explicit resource extraction and learn implicitly from training data. Convolution layers use convolution filter core weights, which are determined and updated as part of the training process. The convolution layers extract different resources from the input, which are combined in the upper layers. The convolutional neural network uses a varied number of convolution layers, each with different convolution parameters, such as core size, distances, padding, number of resource maps and weights. Nonlinear Layers
[00137] [00137] FIGURE 5 represents an implementation of non-linear layers according to an implementation of the disclosed technology. Nonlinear layers use different nonlinear trigger functions to signal distinct identification of likely resources on each hidden layer. Nonlinear layers use a variety of specific functions to implement nonlinear drive, including rectified linear units (ReLUs), hyperbolic tangent, hyperbolic tangent absolute, sigmoid and continuous drive (nonlinear) functions. In an implementation, an activation ReLU implements the function y = max (x, O) and keeps the input and output sizes of a layer the same. The advantage of using ReLU is that the convolutional neural network is trained many times more quickly. RelU is a non-continuous, non-saturating activation function that is linear with respect to the input if the input values are greater than zero and different from zero. Mathematically, a ReLU activation function is described as: p (h) = max (h, 0) h if h> 0 oh) = í O if h <o
[00138] [00138] In other implementations, the convolutional neural network uses an energy unit activation function, which is a continuous, non-saturating function described by: Q (h) = (a + bh) º
[00139] [00139] In the above equation, 4, b and - are parameters that control displacement, scale and power, respectively. The power activation function is capable of producing antisymmetric activation - x or y, If - is odd, and symmetric activation - y, if - is even. In some implementations, the unit produces a non-rectified linear activation.
[00140] [00140] In other implementations, the convolutional neural network uses an activation function of the sigmoid unit, which is a continuous and saturating function described by the following logistic function: EM) =
[00141] [00141] In the above equation, P = 1.The activation function of the sigmoid unit does not produce negative activation and is only antisymmetric in relation to the y axis.
[00142] [00142] FIGURE 6 illustrates dilated convolutions. Dilated convolutions, sometimes called atrous convolutions, which literally mean holes. The French name has its origins in the trous algorithm, which calculates the rapid transformation of dyadic wave. In this type of convolutional layers, the entries corresponding to the receptive field of the filters are not neighboring points. This is illustrated in FIGURE 6. The distance between the inputs depends on the expansion factor.
[00143] [00143] FIGURE 7 is an implementation of subsampling layers according to an implementation of the disclosed technology. Subsampling layers reduce the resolution of resources extracted by convolution layers to make extracted resources or resource maps robust against noise and distortion. In an implementation, the subsampling layers employ two types of pool operations, medium pool and maximum pool. Pool operations divide the entry into two-dimensional, non-overlapping spaces. For the average pool, the average of the four values in the region is calculated. For the maximum pool, the maximum value of the four values is selected.
[00144] [00144] In an implementation, the subsampling layers include pool operations on a set of neurons in the previous layer, mapping their output to only one of the inputs in the maximum pool and mapping their output to the average of the input in the average pool. At the maximum pool, the output of the pool neuron is the maximum value that resides within the input, as described by: P. = Max (Q. G ..... Qv)
[00145] [00145] In the above equation, N is the total number of elements within a set of neurons.
[00146] [00146] In the average pool, the output of the pool neuron is the average value of the input values that reside in the set of input neurons, as described by: 1 N PrvÃO
[00147] [00147] In the above equation, NV is the total number of elements within the set of input neurons.
[00148] [00148] In FIGURE 7, the entry is 4 x 4 size. For 2 x 2 subsampling, a 4 x 4 image is divided into four non-overlapping matrices of size 2 x 2. For the average pool, the average of the four values is the entire integral output. For the maximum pool, the maximum value of the four values in the 2 x 2 matrix is the integral integer output.
[00149] [00149] FIGURE 8 depicts an implementation of a two-layer convolution of the convolution layers. In FIGURE 8, an entry of dimensions of size 2048 is converted. In convolution 1, the entrance is convoluted by a convolutional layer composed of two channels of sixteen cores of size 3 x 3. The resulting sixteen resource maps are then rectified using the RelU activation function in ReLU1 and then grouped in the Pool 1 by medium pool, using a sixteen channel pool layer with 3 x 3 size cores. In convolution 2, the output of Pool 1 is then convoluted by another convolutional layer comprising sixteen channels of thirty cores with a size of 3 x 3. This is followed by another ReLU 2 and medium pool in Pool 2 with a core size of 2 x 2. The convolution layers use a variable number of distances and fills, for example, zero, one, two it's three. The resulting resource vector is five hundred and twelve (512) dimensions, according to an implementation.
[00150] [00150] In other implementations, the convolutional neural network uses different numbers of convolution layers, subsampling layers, non-linear layers and fully connected layers. In one implementation, the convolutional neural network is a shallow network with fewer layers and more neurons per layer, for example, one, two or three layers fully connected with one hundred (100) to two hundred (200) neurons per layer. In another implementation, the convolutional neural network is a deep network with more layers and fewer neurons per layer, for example, five (5), six (6) or eight (8) fully connected layers with thirty (30) to fifty (50) ) neurons per layer.
[00151] [00151] The output of a neuron from line x, column y in the th convolution layer ek-th resource map for the number f of convolution cores in a resource map is determined by the following equation: A kh kh OLD = tanh (3rd 3rd 3rd Wkk9ot + Bias) 120 r = 0 c = 0
[00152] [00152] The output of a neuron from line x, column y in the thirtieth subsample layer and the kth resource map is determined by the following equation:
[00153] [00153] The output of an ith neuron from the output layer is determined by the following equation:
[00154] [00154] The output deviation of a kth neuron in the output layer is determined by the following equation: d (08) = y, -t,
[00155] [00155] The input deviation of a kth neuron in the output layer is determined by the following equation: dU7) = O0, -t) P'V,) =) d (0)
[00156] [00156] The weight and bias variation of a k-th neuron in the output layer is determined by the following equation: AW /) = dU) ,, ABias ) = D (1º)
[00157] [00157] Ovi from a kth neuron in the hidden layer is determined by the following equation: n i <84 o d (0f) = 2.408W ,,
[00158] [00158] The bias of entry of a kth neuron into the hidden layer is determined by the following equation: aU) = DA (ol)
[00159] [00159] The variation of weight and bias in line x, column y in a mth resource map of an anterior layer that receives input from k neurons in the hidden layer is determined by the following equation: AWJ) = aa ”WT, ABias [) = ada ”)
[00160] [00160] The output bias of line x, column y in a mth resource map of the subsample layer S is determined by the following equation: a (Osm SP a (H, II "7» - x VT my
[00161] [00161] The input bias of row x, column y in a mth resource map of the subsample layer S is determined by the following equation: di) = pv) A (07)
[00162] [00162] The variation in weight and bias in line x, column y in a mth resource map of the subsample layer S and convolution layer C is determined by the following equation: AWS "=> S d (ISm gem = s Ss [x / 2) [7/2]] xy ABias ") = S" S d (O5T) 0 po ”
[00163] [00163] The exit bias of line x, column y in a kth convolution layer resource map C is determined by the following equation: (OE) = dUSh (9) W "
[00164] [00164] Ovi of entry of line x, column y in a kth convolution layer resource map C is determined by the following equation: da) = pv a (o)
[00165] [00165] The variation in weight and bias in row r, column c in a mth convolution core of a kth resource map of the / lth convolution layer C: aWEN to OS S a (1Shotm "e & XY) 2 x4r, y4o Ok & E Ck ABias ”> & a (IS5)
[00166] [00166] FIGURE 9 depicts a residual connection that re-injects previous information downstream through the addition of the resource map. A residual connection comprises the reinjection of previous representations in the downstream data stream, adding a passed output tensor to a posterior output tensor, which helps to avoid the loss of information along the data processing flow. Residual connections face two common problems that affect any large-scale deep learning model: dissipation of representational gradients and bottlenecks. In general, adding residual connections to any model that has more than 10 layers is likely to be beneficial. As discussed above, a residual connection comprises making the output of an anterior layer available as input to a posterior layer, effectively creating a shortcut on a sequential network. Instead of being concatenated for later activation, the previous output is added to the later activation, which assumes that both activations are the same size. If they are different sizes, a linear transformation to reshape the previous activation in the target shape can be used.
[00167] [00167] FIGURE 10 depicts an implementation of residual blocks and skip connections. The main idea of residual learning is that residual mapping is much easier to learn than the original mapping. The residual net stacks several residual units to alleviate the degradation of training accuracy. Residual blocks make use of special additive skip connections to combat the dissipation of gradients in deep neural networks. At the beginning of a residual block, the data flow is separated into two streams: the first carries the input unchanged from the block, while the second applies weights and nonlinearities. At the end of the block, the two streams are merged using an element-by-element sum. The main advantage of such constructs is to allow the gradient to flow through the network more easily.
[00168] [00168] Benefited by the residual network, deep convolutional neural networks (CNNs) can be easily trained and improved precision has been achieved for image classification and object detection. Convolutional feed-forward networks connect the output of the lésimas layer as input to the layer (/ + 1) -th, which gives rise to the following layer transition: x -r, (&,) - Residual blocks add a skip connection which ignores nonlinear transformations with an identification function: x = m, (x) + x, - An advantage of residual blocks is that the gradient can flow directly through the identity function from the back layers to the previous layers. However, the identity function and the output of H are combined by the summation, which can prevent the flow of information on the network.
[00169] [00169] WaveNet is a deep neural network for generating raw audio waveforms. WaveNet distinguishes itself from other convolutional networks, as it is able to capture relatively large 'visual fields' at low cost. In addition, it is able to add signal conditioning locally and globally, which allows WaveNet to be used as a text-to-speech (TTS) mechanism with multiple voices, if TTS provides local conditioning, and the specific voice, global conditioning.
[00170] [00170] The main building blocks of WaveNet are dilated causal convolutions. As an extension of the dilated causal convolutions, WaveNet also allows the stacking of these convolutions, as shown in FIGURE 11. To obtain the same receptive field with dilated convolutions in this figure, another layer of dilation is required. The batteries are a repeat of the dilated convolutions, connecting the exits of the dilated convolution layer to a single outlet. This allows WaveNet to obtain a large 'visual' field
[00171] [00171] WavehNet adds a skip connection before the residual connection is made, which ignores all the following residual blocks. Each of these skip connections is added together before going through a series of activation and convolution functions. Intuitively, this is the sum of the information extracted from each layer.
[00172] [00172] Batch normalization
[00173] [00173] Batch normalization is a method to accelerate deep network training, making data standardization an integral part of the network architecture. Batch normalization can adaptively normalize data, even though the mean and variation vary over time during training. It works by internally maintaining an exponential moving average of the batch average and variance of the data seen during training. The main effect of batch normalization is that it helps in the propagation of the gradient - as well as residual connections - and therefore allows deep networks. Some very deep networks can only be trained if they include several layers of Batch Normalization.
[00174] [00174] Batch normalization can be seen as one more layer that can be inserted into the model's architecture, as well as the fully connected or convolutional layer. The BatchNormalization layer is usually used after a convolutional or densely connected layer. It can also be used before a convolutional or densely connected layer. Both implementations can be used by the disclosed technology and are shown in FIGURE 15. The BatchNormalization layer uses an axis argument, which specifies the axis of the resource to be normalized. This argument is defined by default at -1, the last axis in the input tensor. This is the correct value when using Dense layers, Conv1D layers, RNN layers and Conv2D layers with the data format set to "channels last". However, in the case of using Conv2D layers with data format defined as "channels first", the resource axis is axis 1; the axis argument in BatchNormalization can be set to 1.
[00175] [00175] Batch normalization provides a definition to feed the input ahead and calculate the gradients in relation to the parameters and their own input via a backward pass. In practice, the batch normalization layers are inserted after a convolutional or fully connected layer, but before the outputs are fed into an activation function. For convolutional layers, the different elements of the same resource map - that is, activations - in different locations are normalized in the same way, in order to comply with convolutional property. Therefore, all activations in a minilot are normalized in all locations, not by activation.
[00176] [00176] The internal covariate displacement is the main reason why deep architectures were notoriously slow to train. This stems from the fact that deep networks not only need to learn a new representation in each layer, but also have to take into account changes in their distribution.
[00177] [00177] Covariable displacement in general is a known problem in the field of deep learning and often occurs in real-world problems. A common problem with covariable displacement is the difference in the distribution of the training and test set, which can lead to sub-optimal generalization performance. This problem is usually addressed with a bleaching or standardization processing step. However, especially the bleaching operation is computationally expensive and therefore impractical in an online environment, especially if the covariable shift occurs across different layers.
[00178] [00178] The internal covariate displacement is the phenomenon in which the distribution of the network activations changes through the layers due to the change in the network parameters during training. Ideally, each layer should be transformed into a space where they have the same distribution, but the functional relationship remains the same. To avoid costly covariance matrix calculations to reduce correlation and to whiten data across all layers and steps, we normalized the distribution of each input feature in each layer over each minilot to have a mean of zero and a standard deviation of one.
[00179] [00179] During the forward pass, the variance and average of the mini-lot are calculated. With these minilot statistics, the data is normalized by subtracting the mean and dividing by the standard deviation. Finally, the data is scaled and changed with the learned scale and displacement parameters. The forward pass for batch normalization / ,, is shown in FIGURE 12.
[00180] [00180] In FIGURE 12, 27 is the batch mean and o is the batch variance, respectively. The learned parameters of scale and displacement are indicated by y and B, respectively. For clarity, the batch normalization procedure is described in this document by activation and omits the corresponding indexes.
[00181] [00181] As normalization is a differentiable transformation, Errors are propagated for these learned parameters and, therefore, are able to restore the representational power of the network, learning the transformation of identity. On the other hand, by learning the scale and displacement parameters identical to the corresponding batch statistics, the batch normalization transformation would have no effect on the network, if this were the ideal operation to be performed. At the time of testing, the batch mean and variance are replaced by the respective population statistics, as the entry does not depend on other samples from a mini-lot. Another method is to keep the batch statistics averages running during training and use them to calculate the network output at the time of testing. At the time of testing, the batch normalization transformation can be expressed as shown in FIGURE 13. In FIGURE 13, Hp and 523 denote the population mean and variance, rather than the batch statistics, respectively.
[00182] [00182] As normalization is a differentiable operation, the backward pass can be calculated as shown in FIGURE 14. Convolution 1D
[00183] [00183] 1D Conventions extract local 1D subsequences or patches from the sequences, as shown in FIGURE 16. Convolution 1D obtains each output interval from a time patch in the input sequence. The 1D convolution layers recognize local patterns in a sequence. As the same input transformation is performed on each patch, a pattern learned at a given position in the input sequences can later be recognized at a different position, making the translation of the 1D convolution layers invariant for temporal translations. For example, a 1D convolution layer that processes base sequences using size 5 convolution windows must be able to learn bases or base sequences of length 5 or less and must be able to recognize the base motifs in any context in a input sequence. A 1D convolution at the base level is therefore able to learn about the morphology of the base.
[00184] [00184] FIGURE 17 illustrates how the global average pool (GAP) works. The global average pool can be used to replace fully connected layers (FC) for classification, considering the spatial average of the resources in the last layer for scoring. This reduces the training load and ignores overfitting problems. The global average pool applies a structure prior to the model and is equivalent to the linear transformation with predefined weights. The global average pool reduces the number of parameters and eliminates the fully connected layer. Fully connected layers are typically the most parameter and connection intensive layers, and the global average pool provides a much lower cost approach to achieve similar results. The main idea of the global average pool is to generate the average value of each resource map of the last layer as the confidence factor for the score, directly feeding the softmax layer.
[00185] [00185] The global average pool has three benefits: (1) there are no extra parameters in the global average pool layers, therefore overfitting is avoided in the global average pool layers; (2) as the result of the global average pool is the average of the entire resource map, the global average pool will be more robust to space translations; and (3) due to the large number of parameters in fully connected layers, which generally occupy more than 50% in all parameters of the entire network,
[00186] [00186] The global average pool makes sense, as it is expected that stronger resources in the last layer will have a higher average value. In some implementations, the global average pool can be used as a proxy for the rating score. Resource maps under the global average pool can be interpreted as confidence maps and force the correspondence between resource maps and categories. The global average pool can be particularly effective if the resources of the last layer are in an abstraction sufficient for direct classification; however, the global average pool alone is not enough if multilevel resources are combined into groups as part models, which is best done by adding a simple fully connected layer or another classifier after the global average pool.
[00187] [00187] All literature and similar material cited in this application, including, but not limited to, patents, patent applications, articles, books, treaties and web pages, regardless of the format of that literature and similar materials, are expressly incorporated by reference in its entirety. In the event that one or more of the incorporated literature, patents and similar materials differ or contradict this application, including, but not limited to defined terms, use of terms, described techniques or the like, that application prevails.
[00188] [00188] As used in this document, the following terms have the meanings indicated.
[00189] [00189] A base refers to a nucleotide or nucleotide base, A (adenine), C (cytosine), T (thymine) or G (guanine).
[00190] [00190] This application uses the terms "protein" and "translated sequence"; interchangeably.
[00191] [00191] Estepedido uses the terms "codon"; and "triple base" interchangeably.
[00192] [00192] This application uses the terms "amino acid" and "translated unit" interchangeably.
[00193] [00193] This applicator uses the phrases "classifier of pathogenicity of variants", "classifier based on convolutional neural network for classification of variants" and "classifier based on deep convolutional neural network for classification of variants" interchangeably.
[00194] [00194] The term "chromosome" refers to the carrier of genes carrying the inheritance of a living cell, which is derived from chromatin chains that comprise components of DNA and proteins (especially histones). The conventional chromosome numbering system of the internationally recognized human genome is employed in this document.
[00195] [00195] The term "site" refers to a unique position (for example, chromosome ID, chromosome position and orientation) in a reference genome. In some implementations, a site can be a residue, a sequence tag, or the position of a segment in a sequence. The term "locus" can be used to refer to the specific location of a nucleic acid sequence or polymorphism on a reference chromosome.
[00196] [00196] The term "sample" in this document refers to a sample, typically derived from a biological fluid, cell, tissue, organ or organism containing a nucleic acid or a mixture of nucleic acids containing at least one nucleic acid sequence that must be sequenced and / or phased. These samples include, but are not limited to, sputum / oral fluid, amniotic fluid, blood, a blood fraction, fine needle biopsy samples (eg, surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid , pleural fluid, explant tissue, organ culture and any other tissue or cell preparation, or fraction or derivative or isolated from them. Although the sample is often taken from a human subject (e.g., patient), samples can be collected from any organism with chromosomes, including, but not limited to, dogs, cats, horses, goats, sheep, cattle, pigs, etc. The sample can be used directly as obtained from the biological source or after pre-treatment to modify the character of the sample. For example, this pre-treatment may include preparing plasma from the blood, diluting viscous fluids, and so on. Pre-treatment methods can also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, lysis, etc.
[00197] [00197] The term "sequence" includes or represents a chain of nucleotides coupled to each other. Nucleotides can be based on DNA or RNA. It should be understood that a sequence can include multiple sub-sequences. For example, a single sequence (for example, from a PCR amplifier) can have 350 nucleotides. The sample reading can include multiple sub-sequences within these 350 nucleotides. For example, the sample reading may include first and second flanking subsequences with, for example, 20-50 nucleotides. The first and second flanking sequences can be located on either side of a repetitive segment that has a corresponding sub-sequence (for example, 40-100 nucleotides). Each of the flanking sub-sequences can include (or include portions of) a primer sub-sequence (for example, 10-30 nucleotides). For ease of reading, the term "sub-sequence" will be referred to as "sequence", but it is understood that two sequences are not necessarily separated from one another in a common chain. To differentiate the various sequences described in this document, the sequences can be given different labels (for example, target sequence, primer sequence, flanking sequence, reference sequence and the like). Other terms, such as "allele", may be labeled differently to differentiate similar objects.
[00198] [00198] The term "paired-end sequencing" refers to sequencing methods that sequence the two ends of a target fragment. Paired-end sequencing can facilitate the detection of genomic rearrangements and repetitive segments, as well as gene fusions and new transcripts. The methodology for paired end sequencing is described in PCT publication WO07010252, PCT application Serial No. PCTGB2007 / 003798 and US patent application publication US 2009/0088327, each of which is incorporated by reference in this document. In an example, a series of operations can be performed as follows; (a) generating groups of nucleic acids; (b) linearizing the nucleic acids; (c) hybridize a first sequencing primer and perform repeated extension, scanning and unlocking cycles, as established above; (d) "inverting" the target nucleic acids on the surface of the flow cell by synthesizing a complementary copy; (e) linearize the resynthesized chain; and (f) hybridize a second sequencing primer and perform repeated extension, scanning and unlocking cycles, as established above. The inversion operation can be performed by administering reagents as set out above for a single bridge amplification cycle.
[00199] [00199] The term "reference genome" or "reference sequence" refers to any specific sequence of the known genome, partial or complete, of any organism that can be used to reference a subject's identified sequences. For example, a reference genome used for humans, like many other organisms, is found in the National Center for Biotechnology Information, at ncbi.nIm.nih.gov. A "genome" refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences. A genome includes genes and non-coding sequences for DNA. The reference sequence can be greater than the readings aligned to it. For example, it can be at least about 100 times bigger, or at least about 1000 times bigger, or at least about 10,000 times bigger, or at least about 105 times bigger, or at least about 106 times bigger, or at least about 107 times higher. In one example, the reference genome sequence is that of a complete human genome. In another example, the reference genome sequence is limited to a specific human chromosome, such as chromosome 13. In some implementations, a reference chromosome is a chromosomal sequence of the hg19 version of the human genome. Such sequences can be referred to as chromosomal reference sequences, although the term reference genome is intended to cover such sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions (such as chains), etc., of any species. In several implementations, the reference genome is a consensus sequence or other combination derived from several individuals. However, in certain applications, the reference sequence can be obtained from a particular individual.
[00200] [00200] The term "reading" refers to a collection of sequence data that describes a fragment of a sample or nucleotide reference. The term "reading" can refer to a sample reading and / or a reference reading. Usually, although not necessarily, a reading represents a short sequence of contiguous base pairs in the sample or reference. The reading can be represented symbolically by the base pair sequence (in ATCG) of the sample or reference fragment. It can be stored on a memory device and processed as appropriate to determine whether the reading matches a reference string or meets other criteria. A reading can be obtained directly from a sequencing apparatus or indirectly from stored sequence information for the sample. In some cases, a reading is a DNA sequence of sufficient length (for example, at least about 25 bp) that can be used to identify a larger sequence or region, for example, which can be aligned and specifically assigned to a chromosome or genomic region or gene.
[00201] [00201] State-of-the-art sequencing methods include, for example, sequencing by synthesis technology (Illumina), pyrosquencing (454), ion semiconductor technology (lon Torrent sequencing), real-time single molecule sequencing (Pacific Biosciences ) and link sequencing (SOLID sequencing). Depending on the sequencing methods, the length of each reading can vary from about 30 bp to more than 10,000 bp. For example, the Illumin sequencing method using the SOLID sequencer generates nucleic acid readings of about 50 bp. For another example, Lon Torrent Sequencing generates nucleic acid readings of up to 400 bp and pyro-sequencing 454 generates nucleic acid readings of about 700 bp. For another example, single molecule real-time sequencing methods can generate readings from 10,000 to 15,000 bp. Therefore, in certain implementations, the nucleic acid sequence readings are 30-100 bp, 50-200 bp or 50-400 bp in length.
[00202] [00202] The terms "sample reading", "sample sequence" or "sample fragment" refer to the sequence data for a genomic sequence of interest in a sample. For example, the sample reading comprises sequence data from a PCR amplicon having a sequence of forward and reverse primers. Sequence data can be obtained from any selected sequence methodology. The sample reading can be, for example, a synthesis sequencing reaction (SBS), a ligation sequencing reaction or any other suitable sequencing methodology for which you want to determine the length and / or identity of an element repetitive. The sample reading can be a consensus sequence (for example, average or weighted) derived from several sample readings. In certain implementations, the provision of a reference sequence comprises the identification of a locus of interest based on the pimer sequence of the PCR amplicon.
[00203] [00203] The term "raw fragment" refers to sequence data for a portion of a genomic sequence of interest that at least partially overlaps with a designated position or a secondary position of interest within a reading sample or fragment of sample. Non-limiting examples of crude fragments include a duplex concatenated fragment, a simplex concatenated fragment, a duplex non-concatenated fragment and a simplex non-concatenated fragment. The term "raw" is used to indicate that the raw fragment includes sequence data that has some relation to the sequence data in a sample reading, regardless of whether the raw fragment displays a support variant that matches and authenticates or confirms a variant potential in a sample reading. The term "raw fragment" does not indicate that the fragment necessarily includes a support variant that validates a variant call in a sample reading. For example, when a sample reading is determined by a variant call application to display a first variant, the variant call application can determine that one or more raw fragments do not have a corresponding "support" variant type that, otherwise, it can be expected to occur, given the variant in the sample reading.
[00204] [00204] The terms "mapping", "aligned", "alignment" or "ordering" refer to the process of comparing a reading or tag to a reference sequence and thus determining whether the reference sequence contains the sequence of reading. If the reference sequence contains the reading, the reading can be mapped to the reference sequence or, in certain implementations, to a specific location in the reference sequence. In some cases, alignment simply tells you whether or not a reading is a member of a specific reference sequence (that is, whether the reading is present or absent in the reference sequence). For example, aligning a reading with the reference sequence for human chromosome 13 will indicate whether the reading is present in the reference sequence for chromosome 13. A tool that provides this information can be called an established association tester. In some cases, an alignment additionally indicates a location in the reference sequence where the reading or the tag is mapped. For example, if the reference sequence is a sequence of the complete human genome, an alignment may indicate that a reading is present on chromosome 13 and may also indicate that the reading is on a specific chain and / or site on chromosome 13.
[00205] [00205] The term "indel" refers to the insertion and / or deletion of bases in the DNA of an organism. A micro-indel represents an indel that results in a net change of 1 to 50 nucleotides. In the coding regions of the genome, unless the length of an indel is a multiple of 3, it will produce a frame shift mutation. Indels can be contrasted with point mutations. An indel inserts and deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels can also be contrasted with a Tandem Base Mutation (TBM), which can be defined as substitution in adjacent nucleotides (mainly substitutions in two adjacent nucleotides, but substitutions were observed in three adjacent nucleotides).
[00206] [00206] The term "variant" refers to a nucleic acid sequence that is different from a nucleic acid reference. The typical nucleic acid sequence variant includes, without limitation, single nucleotide polymorphism (SNP), short deletion and insertion polymorphisms (Indel), copy number variation (CNV), microsatellite markers or short tandem repeats and variation structural. The so-called somatic variant is the effort to identify variants present at low frequency in the DNA sample. The so-called somatic variants are of interest in the context of cancer treatment. Cancer is caused by an accumulation of mutations in the DNA. A DNA sample from a tumor is usually heterogeneous, including some normal cells, some cells in an early stage of cancer progression (with fewer mutations) and some cells in an advanced stage (with more mutations). Due to this heterogeneity, when sequencing a tumor (for example, from a sample of FFPE), somatic mutations usually appear at low frequency. For example, an SNV can be seen in only 10% of the readings that cover a given base. A variant that should be classified as somatic or germ line by the variant classifier is also referred to in this document as the "test variant".
[00207] [00207] The term "noise" refers to an incorrect variant call resulting from one or more errors in the sequencing process and / or in the variant call request.
[00208] [00208] The term "variant frequency" represents the relative frequency of an allele (variant of a gene) in a specific locus of a population, expressed as a fraction or percentage. For example, the fraction or percentage can be the fraction of all the chromosomes in the population that carry that allele. As an example, the frequency of the sample variant represents the relative frequency of an allele / variant at a given locus / position along a genomic sequence of interest on a "population" corresponding to the number of readings and / or samples obtained for the genomic sequence of interest of an individual. As another example, a reference variant frequency represents the relative frequency of an allele / variant at a specific locus / position over one or more reference genomic sequences where the "population" corresponds to the number of readings and / or samples obtained for the one or more reference genomic sequences from a population of normal individuals.
[00209] [00209] The term "variant allelic frequency (VAF)" refers to the percentage of sequenced readings observed corresponding to the variant divided by the overall coverage at the target position. VAF is a measure of the proportion of sequential readings that carry the variant.
[00210] [00210] The terms "position", "designated position" and "locus" refer to a location or coordinate of one or more nucleotides within a nucleotide sequence. The terms "position", "designated position" and "locus" also refer to a location or coordinate of one or more base pairs in a nucleotide sequence.
[00211] [00211] The term "haplotype" refers to a combination of alleles at adjacent locations on a chromosome that are inherited together. A haplotype can be a locus, several loci, or an entire chromosome, depending on the number of recombination events that have occurred between a given set of loci, if any.
[00212] [00212] The term "limit" in this document refers to a numeric or non-numeric value that is used as a cut-off point to characterize a sample, a nucleic acid or a portion thereof (for example, a reading). A limit may vary based on empirical analysis. The threshold can be compared to a measured or calculated value to determine whether the source that generates that value suggests that it should be classified in a specific way. Limit values can be identified empirically or analytically. The choice of a limit depends on the level of confidence that the user wants to have to make the classification. The limit can be chosen for a specific purpose (for example, to balance sensitivity and selectivity). As used in this document, the term
[00213] [00213] In some implementations, a metric or score based on sequencing data can be compared to the limit. As used in this document, the terms "metric" or "punctuation" can include values or results that were determined from the sequencing data or can include functions based on the values or results that were determined from the sequencing data. As a limit, the metric or the score can be adaptable to the circumstances. For example, the metric or score can be a normalized value. As an example of a score or metric, one or more implementations can use counting scores when analyzing data. A score score can be based on the number of sample readings. The sample readings may have gone through one or more filtering stages, so that the sample readings have at least one common characteristic or quality. For example, each of the sample readings used to determine a count score may have been aligned with a reference sequence or may be assigned as a potential allele. The number of sample readings with a common characteristic can be counted to determine a count of readings. Counting scores can be based on the reading count. In some implementations, the score of the count can be a value equal to the reading count. In other implementations, the score of the count can be based on the reading count and other information. For example, a count score can be based on the reading count for a specific allele of a genetic locus and a total number of readings for the genetic locus. In some implementations, the score of the count can be based on the reading count and the data previously obtained for the genetic locus. In some implementations, counting scores can be normalized scores between predetermined values. The score score can also be a function of the reading counts of other loci in a sample or a function of the reading counts of other samples that were run simultaneously with the sample of interest. For example, the score score may be a function of the reading count for a specific allele and the reading counts of other loci in the sample and / or the counts of other samples. As an example, reading counts from other loci and / or reading counts from other samples can be used to normalize the counting score for the specific allele.
[00214] [00214] The terms "cover" or "fragment cover" refer to a count or other measure of a number of sample readings for the same fragment in a sequence. A reading count can represent a count of the number of readings that cover a corresponding fragment. Alternatively, coverage can be determined by multiplying the reading count by a designated factor that is based on historical knowledge, sample knowledge, locus knowledge, etc.
[00215] [00215] The term "reading depth" (conventionally a number followed by "x") refers to the number of sequential readings with overlapping alignment at the target position. This is usually expressed as an average or percentage that exceeds a cutoff point in a set of intervals (such as exons, genes or panels). For example, a clinical report may say that the average coverage of the panel is 1.105 x with 98% of the targeted bases covered> 100 x.
[00216] [00216] The terms “base call quality score”
[00217] [00217] The terms "variant readings" or "variant read number" refer to the number of sequential readings that support the presence of the variant.
[00218] [00218] The implementations set out in this document may be applicable to the analysis of nucleic acid sequences to identify sequence variations. The implementations can be used to analyze possible variants / alleles of a genetic position / locus and to determine a genotype of the genetic locus or, in other words, to provide a genotype generation for the locus. For example, nucleic acid sequences can be analyzed according to the methods and systems described in US Patent Application Publication No. 2016/0085910 and in US Patent Application Publication No. 2013/0296175, the entire object of which is expressly incorporated by reference in this document in its entirety.
[00219] [00219] In an implementation, a sequencing process includes receiving a sample that includes or is suspected of including nucleic acids, such as DNA. The sample can be from a known or unknown source, such as an animal (for example, human), plant, bacteria or fungus. The sample can be collected directly from the source. For example, blood or saliva can be collected directly from an individual. Alternatively, the sample may not be obtained directly from the source. Then, one or more processors direct the system to prepare the sample for sequencing. The preparation may include removing foreign material and / or isolating certain material (for example, DNA). The biological sample can be prepared to include characteristics for a particular assay. For example, the biological sample can be prepared for synthesis sequencing (SBS). In certain implementations, the preparation may include amplification of certain regions of a genome. For example, the preparation can include amplifying predetermined genetic loci that are known to include STRs and / or SNPs. Genetic loci can be amplified using predetermined primers.
[00220] [00220] Then, one or more processors direct the system to sequence the sample. Sequencing can be performed using a variety of known sequencing protocols. In specific implementations, sequencing includes SBS. In SBS, a plurality of fluorescently labeled nucleotides are used to sequence a plurality of amplified DNA clusters (possibly millions of clusters) present on the surface of an optical substrate (for example, a surface that at least partially defines a channel in a cell flow). Flow cells can contain nucleic acid samples for sequencing, where flow cells are placed within the appropriate flow cell holders.
[00221] [00221] Nucleic acids can be prepared to comprise a known primer sequence that is adjacent to an unknown target sequence. To initiate the first SBS sequencing cycle, one or more differently labeled nucleotides and DNA polymerase, etc., can be flowed to / through the flow cell by a fluid flow subsystem. A single type of nucleotide can be added at a time, or the nucleotides used in the sequencing procedure can be specially designed to have a reversible termination property, thus allowing each cycle of the sequencing reaction to occur simultaneously in the presence of several types of nucleotides marked (for example, A, C, T, G). Nucleotides can include detectable marker moieties, such as fluorophores. Where the four nucleotides are mixed, the polymerase is able to select the correct base to incorporate and each sequence is extended by a single base. Unincorporated nucleotides can be removed by washing, with a washing solution flowing through the flow cell. One or more lasers can excite nucleic acids and induce fluorescence. The fluorescence emitted from the nucleic acids is based on the fluorophores of the incorporated base and different fluorophores can emit different wavelengths of the emitting light. An unlocking reagent can be added to the flow cell to remove reversible terminator groups from the DNA strands that have been extended and detected. The deblocking reagent can then be washed by flowing a washing solution through the flow cell. The flow cell is then ready for an additional cycle of sequencing starting with the introduction of a labeled nucleotide as set out above. Fluidic and detection operations can be repeated several times to complete a sequencing run. Examples of sequencing methods are described, for example, in Bentley et al., Nature 456: 53-59 (2008), International Publication No. WO 04/018497; Pat. No. 7,057,026; International Publication No. WO 91/06678; International Publication No. WO 07/123744; Pat. No. 7,329,492; US Patent No. 7,211,414; US Patent No. 7,315,019; US Patent No. 7,405,281 and US Patent Application Publication No. 2008/0108082, each of which is incorporated herein by reference.
[00222] [00222] In some implementations, nucleic acids can be attached to a surface and amplified before or during sequencing. For example, amplification can be performed using bridged amplification to form groups of nucleic acids on a surface. Useful methods of bridging amplification are described, for example, in US Patent No. 5,641,658; US Patent Application Publication No. 2002/0055100; US Patent No. 7,115,400; US Patent Application Publication No. 2004/0096853; US Patent Application Publication No. 2004/0002090; US Patent Application Publication No. 2007/0128624; and Publication of US Patent Application No. 2008/0009420, each of which is incorporated herein by reference in its entirety. Another useful method for amplifying nucleic acids on a surface is rolling circle amplification (RCA), for example, as described in Lizardi et al., Nat. Genet. 19: 225-232 (1998) and US Patent Application Publication No. 2007/0099208 A1, each of which is incorporated herein by reference.
[00223] [00223] An example of a SBS protocol exploits modified nucleotides with removable 3 'blocks, for example, as described in International Publication No. WO 04/018497, US Patent Application Publication No. 2007 / 0166705A1 and US Patent No. 7,057,026, each of which is incorporated into this document by reference. For example, repeated cycles of SBS reagents can be delivered to a flow cell with target nucleic acids attached to them, for example, as a result of the bridged amplification protocol. The nucleic acid clusters can be converted to single-stranded form using a linearization solution. The linearization solution can contain, for example, a restriction endonuclease capable of cleaving a strand from each cluster. Other cleavage methods can be used as an alternative to restriction enzymes or cutting enzymes, including, but not limited to, chemical cleavage (for example, cleavage of a diol bond with periodate), cleavage of abasic sites by cleavage with endonuclease (for example, example, "USER" as provided by NEB, lpswich, Mass., USA, part number MB5505S), by exposure to heat or alkali, cleavage of ribonucleotides incorporated in amplification products otherwise understood by deoxyribonucleotides, photochemical cleavage or cleavage of a peptide linker. After the linearization operation, a sequencing primer can be delivered to the flow cell under conditions for hybridizing the sequencing primer with the target nucleic acids that must be sequenced.
[00224] [00224] A flow cell can then be contacted with an SBS extension reagent having modified nucleotides with 3 'removable blocks and fluorescent markers under conditions to extend a primer hybridized to each target nucleic acid by a single nucleotide addition. Only a single nucleotide is added to each primer, because once the modified nucleotide has been incorporated into the growing polynucleotide chain complementary to the region of the template being sequenced, there is no free 3'-OH group available to direct the extension of the additional sequence and therefore, the polymerase cannot add more nucleotides. The SBS extension reagent can be removed and replaced with a scanning reagent containing components that protect the sample under radiation excitation. Examples of components for scanning reagents are described in US Patent Application Publication No. 2008/0280773 A1 and in US Patent Application No. 13 / 018,255, each of which is incorporated herein by reference. Extended nucleic acids can then be detected by fluorescence in the presence of the scanning reagent. Once fluorescence is detected, block 3 'can be removed using an deblocking reagent suitable for the group of blocks used. Examples of deblocking reagents that are useful for the respective block groups are described in WO0O04018497, US 2007 / 0166705A1 and US Patent No. 7,057,026, each of which is incorporated herein by reference. The deblocking reagent can be washed, leaving the target nucleic acids hybridized with primers extended with 3'-OH groups which are now competent for the addition of an additional nucleotide. Consequently, the addition cycles of extension reagent, scanning reagent and unlocking reagent, with optional washes between one or more of the operations, can be repeated until a desired sequence is obtained. The above cycles can be performed using a single extension reagent delivery operation per cycle when each of the modified nucleotides has a different marker attached to it, known to correspond to the particular base. The different markers facilitate the discrimination between the nucleotides added during each incorporation operation. Alternatively, each cycle can include separate extension reagent delivery operations followed by separate scanning reagent delivery and detection operations, in which case two or more of the nucleotides can have the same marker and can be distinguished based on the order of delivery known.
[00225] [00225] Although the sequencing operation has been discussed above with respect to a specific SBS protocol, it will be understood that other protocols for sequencing any of a variety of other molecular analyzes can be performed as desired.
[00226] [00226] Then, one or more processors in the system receive the sequencing data for subsequent analysis. Sequencing data can be formatted in several ways, such as in a .BAM file. Sequencing data can include, for example, a number of sample readings. Sequencing data can include a plurality of sample readings that have corresponding sample sequences of the nucleotides. Although only one sample reading is discussed, it should be understood that the sequencing data can include, for example, hundreds, thousands, hundreds of thousands or millions of sample readings. Different sample readings can have different nucleotide numbers. For example, a reading sample can vary between 10 nucleotides and about 500 nucleotides or more. Sample readings can span the entire genome of the source (s). As an example,
[00227] [00227] Each sample reading can include a nucleotide sequence, which can be referred to as a sample sequence, sample fragment or a target sequence. The sample sequence can include, for example, primer sequences, flanking sequences and a target sequence. The number of nucleotides within the sample sequence can include 30, 40, 50, 60, 70, 80, 90, 100 or more. In some implementations, one or more of the sample readings (or sample strings) include at least 150 nucleotides, 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 nucleotides or more. In some implementations, sample readings can include more than 1000 nucleotides, 2000 nucleotides or more. Sample readings (or sample strings) can include primer strings at one or both ends.
[00228] [00228] Then, the one or more processors analyze the sequencing data to obtain potential variant calls and a sample variant frequency of the sample variant calls. The operation can also be referred to as a variant call application or variant caller. Thus, the variant caller identifies or detects variants and the variant classifier classifies the detected variants as somatic or from germ lines. Alternative variant callers can be used according to the implementations in this document, where different variant callers can be used based on the type of sequencing operation being performed, based on sample characteristics that are of interest and similar. A non-limiting example of a variant calling application, such as the Pisces' Y application by Illumina Inc. (San Diego, CA) hosted at https://github.com/Illumina/Pisces and described in Dunn, Tamsen & Berry , Gwenn & Emig-Agius, Dorothea &
[00229] [00229] This variant call application can comprise four modules executed sequentially:
[00230] [00230] (1) Pisces Reading Concatenator: reduces noise by concatenating paired readings in a BAM (reading one and reading two of the same molecule) in consensus readings. The output is a concatenated BAM.
[00231] [00231] (2) Pisces Variant Caller: generates small SNVs, insertions and deletions. Pisces includes a variant collapse algorithm to join variants divided by reading limits, basic filtering algorithms and a simple Poisson-based confidence scoring algorithm. At the exit is a VCF.
[00232] [00232] (3) Pisces Variant Quality Recalibrator (VQR): in the event that variant callers predominantly follow a pattern associated with thermal damage or deamination of FFPE, the VOR step will lower the Q variant score of callers of suspect variants. The output is an adjusted VCF.
[00233] [00233] (4) Pisces Variant Phaser (Scylla): uses a greedy grouping method with reading support to assemble small variants in complex alleles from clonal subpopulations. This allows more precise determination of the functional consequences by the downstream tools. The output is an adjusted VCF.
[00234] [00234] Additionally or alternatively, the operation may use the Strelka'Y application of the Illumina Inc. variant generation application hosted at https://github.com/lllumina/strelka and described in the article T Saunders, Christopher & Wong, Wendy & Swamy, Sajani & Becq, Jennifer and J. Murray, Lisa and Cheetham, Keira. (2012). Strelka: Accurate somatic small-
[00235] [00235] This variant annotation / generation tool can apply different algorithmic techniques, such as those disclosed in Nirvana:
[00236] [00236] the identification of all overlapping transcripts with Interval Matrix: For functional annotation, we can identify all transcripts overlapping a variant and an interval tree can be used. However, as a set of intervals can be static, we were able to optimize it further for an Interval Matrix. An interval tree returns all transcripts overlaid in O (min (n, Klg n)) time, where n is the number of intervals in the tree and k is the number of overlapping intervals. In practice, since k is really small compared to n for most variants, the effective run time in the range tree would be O (K lg n). We improved for O (lg n + k) by creating an interval matrix in which all intervals are stored in a sorted matrix, so we just need to find the first overlapping interval and then enumerate until the rest (k-1).
[00237] [00237] b. CNVs / SVs (Yu): annotations can be provided for varying the number of copies and structural variants. Similar to the annotation of small variants, transcripts superimposed on the SV and also structural variants previously reported can be annotated in online databases. Unlike small variants, not all overlapping transcripts need to be annotated, as many transcripts will be overlaid with large SVs. Instead, all overlapping transcripts that belong to a partial overlapping gene can be noted. Specifically, for these transcripts, introns, exons and the damaged consequences caused by structural variants can be reported. An option is available to allow all overlapping transcripts to be output, but the basic information for these transcripts can be reported as a symbol for the gene, signaling whether it is canonical overlap or partially overlapping transcripts. For each SV / CNV, it is also interesting to know if these variants have been studied and their frequencies in different populations. Therefore, we report overlapping SVs in external databases, such as 1000 genomes, DGV and ClinGen. To avoid using an arbitrary cut-off point to determine which SV is superimposed, instead, all overlapping transcripts can be used and the reciprocal overlap can be calculated, that is, the length of the overlap divided by the minimum length of these two SVs .
[00238] [00238] c. Report supplementary annotations: Supplementary annotations are of two types: small and structural variants (SVs). SVs can be modeled as ranges and use the range matrix discussed above to identify overlapping SVs. Small variants are modeled as points and matched by position and (optionally) allele. As such, they are searched using a binary search algorithm. Since the supplementary annotation database can be quite large, a much smaller index is created to map the positions of the chromosomes to locate files where the supplementary annotation resides. The index is a sorted array of objects (made up of chromosome position and file location) that can be searched binary using the position. To keep the index size small, several positions (up to a certain maximum count) are compressed into an object that stores values for the first position and only deltas for subsequent positions. As we use binary search, the execution time is O (lg n), where n is the number of items in the database.
[00239] [00239] d.Files without VEP cache
[00240] [00240] E transcription database: the supplementary database files (SAdb) and transcription cache (cache) and are serialized dumps of data objects, such as transcripts and supplementary notes. We use the Ensembl VEP cache as our data source for the cache. To create the cache, all transcripts are inserted in an interval array and the final state of the array is stored in the cache files. So, during annotation, we just need to load a pre-calculated interval matrix and perform searches on it. Since the cache is loaded into memory and the search is very fast (described above), finding overlapping transcripts is extremely fast in Nirvana (with a profile of less than 1% of the total execution time ).
[00241] [00241] f. Supplementary database: SAdb data sources are listed under supplementary material. SAdb for small variants is produced by a k-way merge of all data sources, so that each object in the database (identified by name and reference position) maintains all relevant additional annotations. The problems encountered while analyzing the data source files were documented in detail on the Nirvana home page. To limit memory usage, only the SA index is loaded into memory. This index allows for a quick search of the file location for supplementary annotation. However, as the data needs to be fetched from the disk, the addition of supplementary annotation was identified as the biggest bottleneck in Nirvana (with a profile of - 30% of the total execution time).
[00242] [00242] g. Sequence and Consequence Ontology: Nirvana's functional annotation (when provided) follows the Sequence Ontology (SO) guidelines (http://www.sequenceontology.org/). On some occasions, we had the opportunity to identify problems in the current OS and collaborate with the OS team to improve the annotation status.
[00243] [00243] This variant annotation tool may include pre-processing. For example, Nirvana included a large number of annotations from external data sources, such as ExAC, EVS, project 1000 Genomes, dbSNP, ClinVar, Cosmic, DGV and ClinGen. To make full use of these databases, we need to sanitize their information. We have implemented different strategies to deal with different conflicts that exist in different data sources. For example, in the case of multiple dbSNP entries for the same position and alternative allele, we put all the IDs together in a comma-separated list of IDs; if there are multiple inputs with different CAF values for the same allele, we use the first CAF value. For conflicting ExAC and EVS entries, we consider the number of sample counts and the entry with the highest sample count is used. In the 1000 Genome project, we removed the allele frequency from the conflicting allele. Another problem is inaccurate information. We mainly extracted allele frequency information from 1000 Genome Projects, however, we observed that, for GRCh38, the allele frequency reported in the info field did not exclude samples with the unavailable genotype, leading to deflated frequencies for variants that are not available for all samples. To ensure the accuracy of our annotation, we use the entire genotype at an individual level to calculate the true allele frequencies. As we know, the same variants can have different representations based on different alignments. To ensure that we can accurately report information on variants already identified, we need to pre-process variants from different resources so that they have a consistent representation. For all external data sources, we trim alleles to remove duplicate nucleotides in the reference allele and the alternative allele. For ClinVar, we directly analyze the xml file, perform an alignment of five prime numbers for all variants, which is generally used in the vcf file. Different databases can contain the same set of information. To avoid unnecessary duplicates, we have removed some duplicate information. For example, we removed variants in the DGV that have a data source such as 1000 Genome projects, as we have already reported these variants in 1000 genomes with more detailed information.
[00244] [00244] According to at least some implementations, the variant call application provides callers for low frequency variants, called germline lineage and the like. As a non-limiting example, the variant call application can be run on tumor-only samples and / or normal paired tumor samples. The variant call application can search for single nucleotide (SNV) variations, multiple nucleotide variations (MNV), indels and the like. The variant call application identifies variants, while filtering out incompatibilities due to sequencing errors or sample preparation errors. For each variant, variant callers identify the reference sequence, a variant position and the potential variant sequence (s) (for example, SNV from A to C, or deletion AG to A). The variant call application identifies the sample sequence (or sample fragment), a reference sequence / fragment and a variant call as an indication that a variant is present. The variant call application can identify unprocessed fragments and generate a designation of the unprocessed fragments, a count of the number of unprocessed fragments that verify the potential variant call, the position within the unprocessed fragment where a support variant occurred. and other relevant information. Non-limiting examples of crude fragments include a duplex concatenated fragment, a simplex concatenated fragment, a duplex non-concatenated fragment and a simplex non-concatenated fragment.
[00245] [00245] The variant call application can generate calls in various formats, such as in a .VCF or .GVCF file. Just as an example, the variant call application can be included in a MiSegReporter pipeline (for example, when implemented in the MiSegO sequencing instrument). Optionally, the application can be deployed with multiple workflows. The analysis can include a single protocol or a combination of protocols that analyze the sample readings in a manner designed to obtain the desired information.
[00246] [00246] Then, the one or more processors perform a validation operation in connection with the call of the potential variant. The validation operation can be based on a quality score and / or a layered test hierarchy, as explained below. When the validation operation authenticates or verifies the potential variant call, the validation operation passes the variant call information (from the variant call application) to the sample report generator. Alternatively, when the validation operation invalidates or disqualifies the potential variant call, the validation operation passes a corresponding indication (for example, a negative indicator, a non-call indicator, an invalid call indicator) to the report generator of sample. The validation operation can also pass a confidence score related to a degree of confidence that the variant call is correct or the invalid call designation is correct.
[00247] [00247] Then, the one or more processors generate and store a sample report. The sample report can include, for example, information about a plurality of genetic loci in relation to the sample. For example, for each genetic locus in a predetermined set of genetic loci, the sample report can provide at least one of the following to provide a genotype call; indicate that a genotype call cannot be made; provide a confidence score on a certainty of the genotype call; or indicate possible problems with an assay in relation to one or more genetic loci. The sample report can also indicate the gender of an individual who provided a sample and / or indicate that the sample includes multiple sources. As used in this document, a "sample report" may include digital data (for example, a data file) from a genetic locus or predetermined set of genetic loci and / or a printed report of the genetic locus or sets of genetic loci. Thus, generating or supplying may include creating a data file and / or printing the sample report or displaying the sample report.
[00248] [00248] The sample report may indicate that a variant call has been determined, but has not been validated. When a variant call is considered invalid, the sample report can provide additional information on the basis for determining not to validate the variant call. For example, additional information in the report may include a description of the raw fragments and an extension (for example, a count) in which the raw fragments support or contradict the variant call. In addition or alternatively, the additional information in the report may include the quality score obtained in accordance with the implementations described in this document.
[00249] [00249] The implementations disclosed in this document include analysis of sequencing data to identify potential variant calls. The variant call can be carried out with data stored for a previously performed sequencing operation. Additionally or alternatively, it can be performed in real time while a sequencing operation is being performed. Each of the sample readings is assigned to the corresponding genetic loci. The sample readings can be assigned to the corresponding genetic loci based on the nucleotide sequence of the sample read or, in other words, the order of the nucleotides in the sample reading (for example, A, C, G, T). Based on this analysis, the sample read can be designated as including a possible variant / allele of a specific genetic locus. The sample reading can be collected (or aggregated or grouped) with other sample readings that have been designated as including possible variants / alleles of the genetic locus. The assignment operation can also be referred to as a call operation in which the sample reading is identified as possibly associated with a specific genetic position / locus. Sample readings can be analyzed to locate one or more identification strings (for example, primer sequences) of nucleotides that differentiate the sample reading from other sample readings. More specifically, the identification sequence (s) can identify the sample reading from other samples as being associated with a specific genetic locus.
[00250] The assignment operation may include analyzing the series of n nucleotides in the identification sequence to determine whether the series of n nucleotides in the identification sequence effectively matches one or more of the selected sequences. In particular implementations, the assignment operation may include analyzing the first n nucleotides in the sample sequence to determine whether the first n nucleotides in the sample sequence actually match one or more of the selected sequences. The number n can have a variety of values, which can be programmed into the protocol or entered by a user. For example, the number n can be defined as the number of nucleotides in the shortest selection sequence within the database. The number n can be a predetermined number. The predetermined number can be, for example, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 nucleotides. However, fewer or more nucleotides can be used in other implementations. The number n can also be selected by an individual, as a user of the system. The number n can be based on one or more conditions. For example, the number n can be defined as the number of nucleotides in the shortest primer sequence within the database or a designated number, whichever is less. In some implementations, a minimum value for n can be used, such as 15, so that any primer sequence that is less than 15 nucleotides can be designated as an exception.
[00251] [00251] In some cases, the series of n nucleotides in an identification sequence may not exactly match the nucleotides in the selection sequence. However, the identification sequence can effectively correspond to the selection sequence if the identification sequence is almost identical to the selection sequence. For example, the sample reading may be called for the genetic locus if the series of n nucleotides (for example, the first n nucleotides) of the identification sequence correspond to a selected sequence with no more than a designated number of incompatibilities (for example , 3) and / or a designated number of offsets (for example, 2). Rules can be established so that each mismatch or shift can count as a difference between the sample reading and the primer sequence. If the number of differences is less than a designated number, the sample reading can be called up for the corresponding genetic locus (ie assigned to the corresponding genetic locus). In some implementations, a corresponding score can be determined based on the number of differences between the sample reading identification sequence and the selected sequence associated with a genetic locus. If the corresponding score exceeds a designated match limit, the genetic locus that corresponds to the selected sequence can be designated as a potential locus for reading the sample. In some implementations, subsequent analyzes can be performed to determine whether the sample reading is called for the genetic locus.
[00252] [00252] If the sample corresponds effectively to one of the sequences selected in the database (that is, it corresponds exactly or almost corresponds as described above), the sample reading is attributed or assigned to the genetic locus that correlates with the selected sequence. This can be called a locus call or a provisional locus call, in which the sample reading is called for the genetic locus that correlates with the selected sequence. However, as discussed above, a sample reading can be ordered for more than one genetic locus. In such implementations, additional analyzes can be performed to call or assign the sample reading to only one of the potential genetic loci. In some implementations, the sample reading compared to the reference string database is the first reading of the paired end sequencing. When performing the paired end sequencing, a second reading (representing a crude fragment) is obtained that correlates with the sample reading. After the assignment, the subsequent analysis that is performed with the assigned readings can be based on the type of genetic locus that was called for the assigned reading.
[00253] [00253] Next, the sample readings are analyzed to identify possible called variants. Among other things, the results of the analysis identify the variant called potential, a sample variant frequency, a reference sequence and a position in the genomic sequence of interest in which the variant occurred. For example, if a genetic locus is known to include SNP's, the assigned readings that have been called for the genetic locus can be analyzed to identify the SNP's of the assigned readings. If the genetic locus is known to include repetitive polymorphic DNA elements, the assigned readings can be analyzed to identify or characterize the polymorphic repetitive DNA elements in the sample readings. In some implementations, if an assigned reading actually corresponds to a STR locus and an SNP locus, a warning or flag can be assigned to the sample reading. The sample reading can be designated as a STR locus and an SNP locus. The analysis may include alignment of assigned readings according to an alignment protocol to determine sequences and / or lengths of assigned readings. The alignment protocol may include the method described in International Patent Application No. PCT / US2013 / 030867 (Publication No. WO 2014/142831), filed on March 15, 2013, which is incorporated into this document by reference in its entirety.
[00254] [00254] Then, one or more processors analyze raw fragments to determine if there are support variants in the corresponding positions in the raw fragments. Several types of crude fragments can be identified. For example, the variant caller can identify a type of raw fragment that displays a variant that validates the original called variant. For example, the type of crude fragment may represent a duplex concatenated fragment, a simplex concatenated fragment, a duplex non-concatenated fragment or a simplex non-concatenated fragment. Optionally, other crude fragments can be identified instead of or in addition to the previous examples. In connection with the identification of each type of crude fragment, the caller of the variant also identifies the position, within the crude fragment, in which the support variant occurred, as well as a count of the number of crude fragments that exhibited the support variant. For example, the variant caller can generate an indication that 10 raw fragment readings have been identified to represent duplex concatenated fragments with a support variant at a specific X position. The variant caller can also generate indication that five raw fragment readings have been identified to represent non-concatenated fragments simplex with a support variant in a specific Y position. The caller of the variant can also generate a number of raw fragments that corresponded to the reference strings and therefore did not include a support variant that otherwise way, it would provide evidence to validate the so-called potential variant in the genomic sequence of interest.
[00255] [00255] Next, a count is kept of the crude fragments that include support variants, as well as the position in which the support variant occurred. Additionally or alternatively, a count of the raw fragments that did not include support variants in the position of interest (in relation to the position of the so-called potential variant in the reading of the sample or sample fragment) can be maintained. Additionally or alternatively, a count can be maintained of raw fragments that correspond to a reference sequence and do not authenticate or confirm the call for potential variant. The determined information is generated for the variant call validation application, including a count and the type of raw fragments that support the potential variant call, positions of the support variance in the raw fragments, a count of the raw fragments that do not support the called a potential variant and the like.
[00256] [00256] When a potential variant call is identified, the process generates an indication of the potential variant call, the variant sequence, the position of the variant and a reference sequence associated with it. The variant call is designed to represent a "potential" variant, as errors can cause the calling process to identify a false variant. In accordance with the implementations of this document, the potential variant call is analyzed to reduce and eliminate false or false positive variants. Additionally or alternatively, the process analyzes one or more raw fragments associated with a sample reading and generates a corresponding variant call associated with the raw fragments.
[00257] [00257] Genetic variations can help explain many diseases. Every human being has a unique genetic code and there are many genetic variants within a group of individuals. Most of the deleterious genetic variants were depleted from the genomes by natural selection. It is important to identify which genetic variations may be pathogenic or harmful. This will help researchers to focus on likely pathogenic genetic variants and accelerate the rate of diagnosis and cure for many diseases.
[00258] [00258] Modeling the properties and functional effects (eg pathogenicity) of the variants is an important but challenging task in the field of genomics. Despite the rapid advancement of functional genomic sequencing technologies, interpreting the functional consequences of variants remains a major challenge due to the complexity of cell type-specific transcription regulation systems.
[00259] [00259] In relation to pathogenicity classifiers, deep neural networks are a type of artificial neural networks that use several nonlinear and complex transforming layers to successively model high level resources. Deep neural networks provide feedback via backpropagation, which carries the difference between the observed output and the one predicted to adjust the parameters. Deep neural networks have evolved with the availability of large sets of training data, the power of distributed and parallel computing and sophisticated training algorithms. Deep neural networks have facilitated major advances in several domains, such as computer vision, speech recognition and natural language processing.
[00260] [00260] Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are components of deep neural networks. Convolutional neural networks have been particularly successful in recognizing images with an architecture that comprises convolution layers, non-linear layers and pool layers. Recurrent neural networks are designed to use sequential information from input data with cyclical connections between building blocks such as perceptrons, long-term memory units and closed recurring units. In addition, many other emerging deep neural networks have been proposed for limited contexts, such as deep space-time neural networks, multidimensional recurrent neural networks and convolutional auto-encoders.
[00261] [00261] The objective of training deep neural networks is the optimization of the weight parameters in each layer, which gradually combines simpler resources into complex resources, so that the most appropriate hierarchical representations can be learned from the data. A single cycle of the optimization process is organized as follows. First, given a set of training data, the forward pass sequentially calculates the output at each layer and propagates the function signals forward across the network. In the final output layer, an objective loss function measures the error between the inferred outputs and the labels provided. To minimize training error, the backward pass uses the chain rule to back propagate error signals and calculate gradients in relation to all weights across the entire neural network. Finally, the weight parameters are updated using optimization algorithms based on the descent of the stochastic gradient. While descending the batch gradient performs parameter updates for each complete data set, descending the stochastic gradient provides stochastic approximations by performing the updates for each small set of data examples. Several optimization algorithms result from the descent of the stochastic gradient. For example, the training algorithms Adagrad and Adam perform descent of the stochastic gradient while adaptively modifying learning rates based on the update frequency and the moments of the gradients for each parameter, respectively.
[00262] [00262] Another central element in the training of deep neural networks is regularization, which refers to strategies aimed at avoiding overfitting and thus achieving a good generalization performance. For example, weight reduction adds a penalty term to the objective loss function, so that weight parameters converge to smaller absolute values. The dropout randomly removes hidden units from neural networks during training and can be considered a set of possible subnets. To improve the dropout features, a new activation function, maxout and a dropout variant for recurrent neural networks called rnnDrop have been proposed. In addition, batch normalization provides a new method of regularization by normalizing scalar resources for each activation within a minilot and learning each average and variation as parameters.
[00263] [00263] Given that the sequenced data are multidimensional and of high dimension, deep neural networks hold great promises for bioinformatics research due to their wide applicability and improved predictive power. Convolutional neural networks have been adapted to solve genomic problems based on sequences, such as motif discovery, identification of pathogenic variants and inference of gene expression. Convolutional neural networks use a weight-sharing strategy that is especially useful for studying DNA,
[00264] [00264] Therefore, a powerful computational model for predicting the pathogenicity of variants can bring enormous benefits to both basic science and translational research.
[00265] [00265] Currently, only 25 to 30% of patients with rare diseases receive a molecular diagnosis from examining the protein's coding sequence, suggesting that the remaining diagnostic yield may reside in the 99% of the non-coding genome. Here, we describe a new deep learning network that accurately predicts splice junctions from the arbitrary pre-mRNA transcription sequence, allowing accurate prediction of the splice alteration effects of uncoded variants. Synonymous and intronic mutations with an expected consequence of splice alteration are valid at high rates in RNA-seq and are highly harmful in the human population. De novo mutations with an expected consequence of splice alteration are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validated against RNA-seq data in 21 of 28 of these patients. We estimate that 9-11% of pathogenic mutations in patients with rare genetic diseases are caused by this previously underestimated class of disease variation.
[00266] [00266] Exome sequencing transformed the clinical diagnosis of patients and families with rare genetic diseases and, when used as a first-line test, significantly reduces the time and costs of diagnostic odyssey (Monroe et al., 2016; Stark et al ., 2016; Tan et al., 2017). However, the diagnostic yield of exome sequencing is —25 to 30% in cohorts of rare genetic diseases, leaving most patients undiagnosed even after combined exome and microarray tests (Lee et al., 2014; Trujillano et al., 2017; Yang et al., 2014). Non-coding regions play a significant role in gene regulation and account for 90% of the causal disease loci discovered in unbiased human complex disease association studies across the genome (Ernst et al., 2011; Farh et al., 2015 ; Maurano et al., 2012), suggesting that penetrating non-coding variants may also be responsible for a significant load of causal mutations in rare genetic diseases. Indeed, penetrating non-coding variants that disrupt the normal MRNA splicing pattern, despite being outside the essential splice variants of GT and AG, often known as cryptic splice variants, have long been recognized for playing a significant role. in rare genetic diseases (Cooper et al., 2009; Padgett, 2012; Scotti and Swanson, 2016; Wang and Cooper, 2007). However, cryptic splice mutations are often overlooked in clinical practice, due to our incomplete understanding of the splice code and the resulting difficulty in accurately identifying splice-altering variants outside the essential GT and AG dinucleotides (Wang and Burge, 2008).
[00267] [00267] Recently, RNA-segq has emerged as a promising assay for detecting splicing abnormalities in Mendelian disorders (Cummings et al., 2017; Kremer et al., 2017), but so far its usefulness in a clinical setting remains limited to a minority of cases in which the relevant cell type is known and accessible to biopsy. High-throughput screening trials of possible splice alteration variants (Soemedi et al., 2017) have expanded the characterization of splice variation, but are less practical for evaluating random de novo mutations in genetic diseases, since the genomic space where mutations that alter the splice can occur is extremely large. The general prediction of splicing from the arbitrary pre-mRNA sequence would potentially allow accurate prediction of the consequences of splice changes in noncoding variants, substantially improving diagnosis in patients with genetic diseases. So far, a general predictive model of splicing from the raw sequence that addresses the specificity of the spliceosome remains uncertain, despite progress in specific applications, such as modeling the sequence characteristics of the main splicing motifs (Yeo and Burge, 2004), characterize exonic splice enhancers and silencers (Fairbrother et al., 2002; Wang et al., 2004) and predict the inclusion of an exon cassette (Barash et al., 2010; Jha et al., 2017; Xiong et al., 2015).
[00268] [00268] The junction of long pre-mRNAs in mature transcripts is notable for its precision and the clinical severity of the mutations that alter Splice, but the mechanisms by which the cellular machinery determines its specificity remain not fully understood. Here, we train a deep learning network that addresses the precision of the spliceosome in silico, identifying the exon-intron limits of the pre-mRNA sequence with 95% accuracy and predicting cryptic functional splice mutations with a validation rate greater than 80% in RNA-seq. The uncoded variants predicted to alter splicing are strongly harmful in the human population, with 80% of newly created cryptic splice mutations experiencing negative selection, similar to the impact of other classes of truncated protein variation. Cryptic de novo splice mutations in patients with autism and intellectual disability target the same genes that are mutated recurrently by truncating protein mutations, allowing the discovery of additional candidate disease genes. We estimate that up to 24% of penetrating causal mutations in patients with rare genetic diseases are due to this previously underestimated disease variation class, highlighting the need to improve the interpretation of the 99% of the non-coding genome for clinical sequencing applications.
[00269] [00269] Clinical exome sequencing has revolutionized the diagnosis for patients and families with rare genetic diseases and, when used as a first-line test, significantly reduces the time and costs of diagnostic odyssey. However, diagnostic yield for exome sequencing has been reported in 25 to 30% in several large cohorts of rare disease patients and their parents, leaving most patients undiagnosed, even after combined exome and microarray tests. The non-coding genome is highly active in gene regulation, and the non-coding variants represent about 90% of GWAS occurrences for common diseases, suggesting that rare variants in the non-coding genome may also be responsible for a significant fraction of causal mutations in diseases penetrating diseases, such as rare genetic disorders and oncology. However, the difficulty of interpreting variants in the non-coding genome means that, apart from the large structural variants, the non-coding genome currently offers little additional diagnostic benefit in relation to the rare penetrating variants that have the greatest impact on clinical management.
[00270] [00270] The role of splice-altering mutations outside canonical splice GT and AG dinucleotides has been much appreciated in rare diseases. In fact, these cryptic splice variants are the most common mutations for some rare genetic disorders, such as glycogen storage disease XI (Pompe's disease) and erythropoietic protoporphyria. The splice motifs extended at the 5 'and 3' ends of the introns are highly degenerate and equally good motifs often occur in the genome, making an accurate prediction of which noncoding variants may make the cryptic splice impractical with existing methods.
[00271] [00271] To better understand how the spliceosome reaches its specificity, we trained a deep learning neural network to predict each nucleotide in a pre-mRNA transcription, whether it was a splice acceptor, a splice donor or none, using only the sequence of transcription as input (FIGURE 37A). Using canonical transcriptions on even chromosomes as a training set and transcripts on odd chromosomes for testing (with excluded params), the deep learning network calls exon-intron limits with 95% accuracy and even transcripts over 100 KB, like CFTR , are often perfectly reconstructed with nucleotide precision (FIGURE 37B).
[00272] [00272] Next, we try to understand the specificity determinants used by the network to recognize the exon-intron limits with remarkable precision. In contrast to previous classifiers that operate on statistical or human engineering resources, deep learning directly learns the resources of the sequence in a hierarchical manner, allowing an additional specificity to be transmitted in the context of the long-range sequence. In fact, we found that the accuracy of the network is highly dependent on the size of the sequence context that flanks the nucleotide under prediction provided as input to the network (Table 1) and when we train a deep learning model that uses only 40-nt of sequence, performance exceeds only moderately the existing statistical methods. This indicates that deep learning adds little to the existing statistical methods for recognizing individual splicing motifs from 9 to 23nt, but a broader sequence context is the key to distinguishing functional splice sites from non-functional sites with equally strong motives. Asking the network to predict exons where the sequence is interrupted shows that interruption of the donor's motive usually also causes the acceptor signal to disappear (FIGURE 37C), as is often observed with exon jump events in vivo, indicating that a significant degree of specificity is granted simply by requiring pairing between strong acceptor and donor motives at an acceptable distance.
[00273] [00273] Although a large body of evidence has shown that the experimental interruption of exon lengths has strong effects on the inclusion of exons versus the exon jump, this does not explain why the accuracy of the deep learning network continues to increase beyond 1000 nt of context. To better differentiate between specificity driven by local splice reasons and the determinants of long distance specificity, we train a local network that takes only 100-nt of context input. Using the local network to score known junctions, we found that exons and introns have optimal lengths (-115nt for exons, - 1000nt for introns) in which the subject's strength is minimal (FIGURE 37D). This relationship is not present in the 10,000 nt deep learning network (FIGURE 37E), indicating that the intron and exon length variability is already fully factored in the broad context deep learning network. Notably, the limits of introns and exons were never given to the deep learning model in a broad context, indicating that it was able to derive these distances by inferring exon and intron positions only from the sequence.
[00274] [00274] A systematic search of the hexamer space also indicated that the deep learning network uses motifs in the definition of exon-intron, particularly the TACTAAC branch point motif of positions -34 to -14, the well-characterized GAAGAA exon splice enhancer near the end of exons, and poly-U motifs that normally form part of the polypyrimidine tract, but also appear to act as exonic splice silencers (FIGURES 21, 22, 23, and 24).
[00275] [00275] We extended the deep learning network to the evaluation of genetic variants for the splice alteration function, predicting exon-intron limits both in the reference transcription sequence and in the alternative transcription sequence containing the variant, and looking for changes in exon-intron limits. The recent availability of aggregate data from the exome of 60.706 humans allows us to evaluate the effects of negative selection on variants predicted to alter the splice function by examining its distribution in the allele frequency spectrum. We found that the predicted cryptic splice variants are strongly under negative selection (FIGURE 38A), as evidenced by their relative depletion at high allele frequencies compared to expected counts, and their magnitude of depletion is comparable to splice AG or GT and stop gain variants. The impact of negative selection is greater when considering cryptic splice variants that would cause frameshift in relation to those that cause changes in the frame (FIGURE 38B). Based on the depletion of the frameshift cryptic splice variants compared to other protein truncation variation classes, we estimate that 88% of the cryptic splice mutations predicted with confidence are functional.
[00276] [00276] Although there is not as much aggregate data from the entire genome as the exome data, limiting the power to detect the impact of natural selection in deep intronic regions, we were also able to calculate the observed vs. expected counts of cryptic splice mutations away from exonic regions. Overall, we observed a 60% depletion in cryptic splice mutations at a distance of> 50nt from an exon-intron boundary (FIGURE 38C). The attenuated signal is probably a combination of the smaller sample size with complete genome data compared to the exome and the greater difficulty in predicting the impact of deep intronic variants.
[00277] [00277] We can also use the observed versus expected number of cryptic splice variants to estimate the number of cryptic splice variants under selection and how this compares to other classes of protein truncation variation. Since cryptic splice variants can only partially revoke the splice function, we also evaluated the number of observed versus expected cryptic splice variants at more relaxed limits and estimated that there are approximately three times more deleterious rare cryptic splice variants compared to variants of splice interruption AG or GT rare in the EXAC data set (FIGURE 38D). Each individual carries approximately - 20 rare cryptic splice mutations, approximately equal to the number of protein truncation variants (FIGURE 38E), although not all of these variants completely revoke the splice function.
[00278] [00278] The recent release of GTEx data, comprising 148 individuals with entire genomic sequencing and RNA-seq from various tissue sites, allows us to look for the effects of rare cryptic splice variants directly in the RNA-segq data. To approximate the scenario found in the sequencing of rare diseases, we considered only rare variants (singleton in the GTEx cohort and allelic frequency <1% in 1000 genomes) and paired them to splicing events unique to the individual with the variant. Although differences in gene expression and tissue expression and the complexity of abnormalities make it difficult to assess the sensitivity and specificity of deep learning predictions, we have found that, under strict specificity limits, more than 90% of rare cryptic splice mutations are validated in the RNA-seq (FIGURE 39A). A large number of aberrant splice events present in RNA-seq appear to be associated with variants that are expected to have modest effects according to the deep learning classifier, suggesting that they only partially affect the splice function. In these more sensitive limits, it is expected that approximately 75% of the new junctions cause aberrations in the splicing function (FIGURE 38B).
[00279] [00279] The success of the deep learning network in predicting cryptic splice variants that are strongly deleterious in population sequencing data and validated at a high rate in RNA-seq suggests that the method could be used to identify additional diagnoses in sequencing studies of rare diseases. To test this hypothesis, we examined de novo variants in exome sequencing studies for autism and neurodevelopmental disorders and demonstrated that cryptic splice mutations are significantly enriched in affected individuals over healthy counterparts (FIGURE 40A). In addition, the enrichment of cryptic splice mutations is slightly less than that of protein truncation variants, indicating that approximately 90% of our predicted cryptic splice variants are functional. Based on these values, approximately —- 20% of disease-causing protein truncation variants can be attributed to cryptic splice mutations in the exons and nucleotides immediately adjacent to the exons (FIGURE 40B). Extrapolating this number to complete genomic studies, which are capable of interrogating the entire intronic sequence, we estimate that 24% of the causal mutations in rare genetic diseases are due to cryptic splice mutations.
[00280] [00280] We estimate the probability of calling a cryptic splice mutation again for each individual gene, allowing us to estimate the enrichment of cryptic splice mutations in the disease candidate genes compared to chance. The cryptic splice mutations again were strongly enriched within genes that were previously affected by protein truncation variation, but not by missense variation (FIGURE 40C), indicating that they cause the disease mainly by haploinsufficiency instead of other modes of action . The addition of predicted cryptic splice mutations to the list of protein truncation variants allows us to identify 3 additional disease genes in autism and 11 additional disease genes in intellectual disability, compared to using only protein truncation variation (FIGURE 40D ).
[00281] [00281] To assess the feasibility of validating cryptic splice mutations in patients for whom probable disease tissue was not available (brain in this case), we performed deep RNA-seq on 37 subjects with newly predicted cryptic splice mutations from Simon's Simplex Collection, and looked for aberrant splice events that were present in the individual and absent in all other individuals in the experiment and in the 149 individuals in the GTEx cohort. We found that the NN of 37 patients showed a single and aberrant splice in the RNA-seq (FIGURE 40E), explained by the predicted cryptic splice variant.
[00282] [00282] In summary, we demonstrate a deep learning model that accurately predicts cryptic splice variants with sufficient precision to be useful in identifying causal mutations in rare genetic diseases. We estimate that a substantial fraction of diagnoses of rare diseases caused by cryptic splicing are currently lost, considering only the protein coding regions, and emphasize the need to develop methods to interpret the effects of the rare penetrating variation in the non-coding genome.
[00283] [00283] —We constructed a residual deep neural network (He et al., 2016a) that predicts whether each position in a pre-mRNA transcript is a splice donor, a splice acceptor or none (FIGURE 37A and FIGURES 21, 22 , 23, and 24), using as input only the genomic sequence of the pre-mRNA transcript. As splice donors and acceptors can be separated by tens of thousands of nucleotides, we employ a new network architecture that consists of 32 dilated convolutional layers
[00284] [00284] We used pre-mRNA transcription sequences annotated by GENCODE (Harrow et al, 2012) on a subset of human chromosomes to train neural network parameters and transcriptions on the remaining chromosomes, with excluded parallels, to test the network predictions . For pre-mRNA transcripts in the test data set, the network predicts splice joints with 95% top-k accuracy, which is the fraction of splice sites correctly predicted at the limit where the number of predicted sites is equal to actual number of splice sites present in the test data set (Boyd et al., 2012; Yeo and Burge, 2004). Even genes above 100 kb, such as CFTR, are often perfectly reconstructed with nucleotide precision (FIGURE 37B). To confirm that the network does not depend only on exonic sequence biases, we also tested the network on long, non-coding RNAs. Despite the incompleteness of hand coding transcription annotations, which is expected to reduce our accuracy, the network predicts splice junctions known in lincRNAs with 84% top-Kk accuracy (FIGURES 42A and 42B), indicating that it can approximate behavior spliceosome in arbitrary sequences free of selective protein coding pressures.
[00285] [00285] For each exon annotated with GENCODE in the test data set (excluding the first and last exons of each gene), we also examined whether the network's prediction scores correlate with the fraction of readings that support the exon inclusion versus the exon leap, based on RNA-seq data from the Atlas Gene and Tissue Expression (GTEx) (The GTEx Consortium et al., 2015) (FIGURE 37C). Exons that splitted constitutively or splice through GTEx tissues had prediction scores close to 1 or O, respectively, while exons subjected to a substantial degree of alternative splicing (between 10- 90% inclusion of exons on average in the samples) tended a (Pearson's correlation = 0.78, P = O).
[00286] [00286] Next, we try to understand the sequence determinants used by the network to achieve its remarkable accuracy. We performed systematic in silico substitutions of each nucleotide near annotated exons, measuring the effects on the prediction scores of the network at the adjacent splice sites (FIGURE 37E). We found that interrupting the sequence of a splice donor motif often caused the network to predict that the upstream splice acceptor site would also be lost, as is seen with exon jump events in vivo, indicating that a significant degree of specificity it is conferred by the definition of exon between a paired upstream acceptor motif and a downstream donor motif defined at an ideal distance (Berget, 1995). Additional motifs that contribute to the splicing signal include the well-characterized binding motifs of the SR protein family and the branch point (FIGURES 43A and 43B) (Fairbrother et al., 2002; Reed and Maniatis, 1988). The effects of these reasons are highly dependent on their position in the exon, suggesting that their roles include specifying the precise positioning of the intron-exon boundaries, differentiating between the accepting and donor competing sites.
[00287] [00287] Network training with different input sequence contexts significantly affects the accuracy of splice predictions (FIGURE 37E), indicating that long-range sequence determinants of up to 10,000 nt from the splice site are essential to discern functional splice junctions. splice of the large number of non-functional sites with almost ideal motifs. To examine the determinants of long- and short-range specificity, we compared the scores assigned to the junctions noted by a model trained in 80 nt of sequence context (SpliceNet-80nt) versus the complete model trained in 10,000 nt of context (SpliceNet-10k) . The network trained in 80 nt of sequence context assigns lower scores to junctions that join exons or introns of normal length (150 nt for exons, - 1000 nt for introns) (FIGURE 37F), according to previous observations that these sites they tend to have weaker splice motifs compared to exon and intron splice sites that are extraordinarily long or short (Amit et al., 2012; Gelfman et al., 2012; Li et al., 2015). On the other hand, the network trained in 10,000 nt of sequence context shows a preference for medium-length introns and exons, despite their weaker splice motifs, because it can explain the long-range specificity conferred by the exon or the intron. The leap from weaker motifs to long, uninterrupted introns is consistent with the faster elongation of RNA polymerase |, experimentally observed in the absence of exon pause, which may allow the spliceosome less time to recognize sub-optimal motifs (Close et al., 2012; Jonkers et al., 2014; Veloso et al., 2014). Our findings suggest that the average splice junction has favorable determinants of the long-range sequence that confer substantial specificity, explaining the high degree of tolerated sequence degeneration in most splice motifs.
[00288] [00288] As splicing occurs co-transcriptionally (Cramer et al., 1997; Tilgner et al., 2012), interactions between chromatin status and co-transcriptional splicing can also guide the definition of exon (Luco et al ., 2011) and have the potential to be used by the network as long as the chromatin status is predictable from the primary sequence. In particular, studies across the genome of nucleosome placement have shown that nucleosome occupation is higher in exons (Andersson et al., 2009; Schwartz et al., 2009; Spies et al., 2009; Tilgner et al., 2009). To test whether the network uses sequence determinants for nucleosome placement to predict the splice site, we scan a pair of acceptor and optimal donor motifs separated by 150 nt (approximately the size of the average exon) across the genome and ask the network to predict if a couple of reasons would result in the inclusion of the exon in that locus (FIGURE 37G). We found that positions predicted to favor exon inclusion correlated with positions of high nucleosome occupation, even in intergenic regions (Spearman correlation = 0.36, P = O), and this effect persists after controlling the content of GC (FIGURE 44A). These results suggest that the network has learned implicitly to predict the position of nucleosomes from the primary sequence and uses it as a determinant of specificity in the definition of the exon. Similar to exons and mid-length introns, exons positioned on nucleosomes have weaker local splice motifs (FIGURE 44B), consistent with greater tolerance for degenerate motifs in the presence of compensatory factors (Spies et al., 2009).
[00289] [00289] Although several studies have reported a correlation between exons and nucleosome occupation, a causal role for the positioning of nucleosomes in the definition of exons has not yet been firmly established. Using data from 149 individuals with RNA-seq sequencing and the entire genome of the Genotype-Tissue Expression (GTEx) cohort (The GTEx Consortium et al., 2015), we identified new exons that were particular to a single individual and corresponded to a mutation genetics that creates a private splice site. These private exon creation events were significantly associated with the positioning of existing nucleosomes in K562 and GM12878 cells
[00290] [00290] We extended the deep learning network to the evaluation of genetic variants for the splice alteration function, providing for exon-intron limits for the reference pre-mRNA transcription sequence and the alternative transcription sequence containing the variant, and considering the difference between the scores (Score A) (FIGURE 38A). It is important to note that the network was trained only on sequences of reference transcripts and annotations of splice joints, and never saw variant data during training, making predicting variant effects a challenging test of the network's ability to model accurately the determinants of the splicing sequence.
[00291] [00291] We looked for the effects of cryptic splice variants on RNA-seq data in the GTEx cohort (The GTEx Consortium et al., 2015), comprising 149 individuals with entire genome sequencing and RNA-seq from various tissues. To approximate the scenario found in the sequencing of rare diseases, we first focus on rare and private mutations (present in only one individual in the GTEx cohort). We found that private mutations that were predicted to have functional consequences for the neural network are strongly enriched in new splice junctions of private and skipped exon boundaries in private exon jump events (FIGURE 38B), suggesting that a large fraction of these predictions are functional.
[00292] [00292] To quantify the effects of splice site creation variants on the relative production of normal and aberrant splice isoforms, we measure the number of readings that support the new splice event as a fraction of the total number of readings covering the site (FIGURE 38C) (Cummings et al., 2017). For splice site interruption variants, we observed that many exons had a low exon jump reference rate, and the effect of the variant was to increase the fraction of exon jump readings. Therefore, we calculated both the decrease in the fraction of readings that splice at the interrupted junction and the increase in the fraction of readings that skipped the exon, having the greater of the two effects (FIGURE 45 and STAR Methods).
[00293] [00293] The cryptic splice variants predicted with confidence (Score A> 0.5) are validated in the RNA-seq in three quarters the rate of essential splice interruptions of GT or AG (FIGURE 38D). Both the validation rate and the effect size of the cryptic splice variants closely follow their A scores (FIGURES 38D and 38E), demonstrating that the model's prediction score is a good proxy for the splice change potential of a variant . Validated variants, especially those with lower scores (Score A <0.5), are often incompletely penetrating and result in alternative splicing with the production of a mixture of aberrant and normal transcripts in the RNA-seq data (FIGURE 38E). Our estimates of validation rates and effect sizes are conservative and are likely to underestimate the true values, due to changes in unaccounted splice isoforms and meaningless mediated decay, which preferentially degrades aberrant splicing transcripts because they preferentially introduce premature stop codons ( FIGURE 38C and FIGURE 45). This is evidenced by the average effect sizes of variants that interrupt the essential splice dinucleotides of GT and AG, being less than the 50% expected for fully penetrating heterozygous variants.
[00294] [00294] For cryptic splice variants that produce aberrant splice isoforms in at least three tenths of the observed copies of the MRNA transcript, the network has a sensitivity of 71% when the variant is close to exons and 41% when the variant is in deep intronic sequence (Score A> 0.5, FIGURE 38F). These findings indicate that deep intronic variants are more difficult to predict, possibly because deep intronic regions contain less specificity determinants that were selected to be close to exons.
[00295] [00295] To compare the performance of our network with existing methods, we selected three popular classifiers that were referenced in the literature for the diagnosis of rare genetic disease, GenesSplicer (Pertea et al., 2001), MaxEntScan (Yeo and Burge, 2004) and NNSplice (Reese et al., 1997), and we plotted the validation rate of the RNA-seq and the sensitivity in variable limits (FIGURE 38G). As has been the experience of others in the field (Cummings et al., 2017), we found that the existing classifiers have insufficient specificity, given the very large number of non-coding variants across the genome that can affect splicing, presumably because they focus on local motives and, to a large extent, are not responsible for long-range determinants of specificity.
[00296] [00296] Given the large gap in performance compared to existing methods, we performed additional controls to exclude the possibility that our results in the RNA-seq data could be confused by overfitting. First, we repeat the validation and sensitivity analyzes separately for private variants and variants present in more than one individual in the GTEx cohort (FIGURE 46A, 46Be 46C). Since neither the splicing machinery nor the deep learning model has access to allelic frequency information, verifying that the network performs similarly across the allelic frequency spectrum is an important control. We found that, within the same score A limits, the private and common cryptic splice variants do not show significant differences in their validation rate in the RNA-seq (P> 0.05, Fisher's exact test), indicating that the network's predictions are robust to the allelic frequency.
[00297] [00297] Second, to validate the model predictions among the different types of cryptic splice variants that can create new splice joints, we separately evaluate variants that generate new GT or AG dinucleotides, those that affect the acceptor or donor extended motif and variants that occur in more distal regions. We found that cryptic splice variants are distributed approximately equally among the three groups and that within the same A score limits, there are no significant differences in the validation rate or effect sizes between the groups (P> 0.37 uniformity and P> 0.3 Mann Whitney U test, respectively, FIGURES 47A and 47B).
[00298] [00298] Third, we performed the validation of RNA-segq and sensitivity analyzes separately for variants on the chromosomes used for training and variants for the rest of the chromosomes (FIGURES 48A and 48B). Although the network was trained only on genomic reference sequences and splice annotations, and was not exposed to variant data during training, we wanted to rule out the possibility of bias in the variant predictions due to the fact that the network saw the sequence of reference on training chromosomes. We found that the network performs equally well on variants of the training and test chromosomes, with no significant difference in the validation rate or sensitivity (P> 0.05, Fisher's exact test), indicating that predictions of the variance of the network are explained by overfitting the training sequences.
[00299] [00299] Predicting cryptic splice variants is a more difficult problem than predicting annotated splice joints, as reflected by the results of our model and other splice prediction algorithms (compare FIGURE 37E and FIGURE 38G). An important reason is the difference in the underlying distribution of exon inclusion rates between the two types of analysis. The vast majority of exons noted with GENCODE have strong determinants of specificity, resulting in constitutive splicing and prediction scores close to 1 (FIGURE 37C). In contrast, most cryptic splice variants are only partially penetrating (FIGURES 38D and 38E), have low to intermediate prediction scores and often lead to alternative splicing with the production of a mixture of normal and aberrant transcripts. This makes the latter problem of predicting the effects of cryptic splice variants intrinsically more difficult than identifying annotated splice sites. Additional factors, such as senseless mediated decay, not accounted for isoform changes and limitations of the RNA-seq assay, further contribute to decrease the RNA-seq validation rate (FIGURES 38C and FIGURE 45).
[00300] [00300] Alternative splicing is one of the main modes of gene regulation that serves to increase the diversity of transcripts in different tissues and stages of development, and its deregulation is associated with disease processes (Blencowe, 2006; Irimia et al., 2014; Keren et al., 2010; Licatalosi and Darnell, 2006; Wang et al., 2008). Unexpectedly, we found that the relative use of new splice junctions created by cryptic splice mutations can vary substantially between tissues (FIGURE 39A). In addition, variants that cause tissue-specific differences in splicing are reproducible in several individuals (FIGURE 39B), indicating that tissue-specific biology is likely to underlie these differences, rather than stochastic effects. We found that 35% of the enigmatic splice variants with weak and intermediate predicted scores (Score A 0.35 - 0.8) exhibit significant differences in the fraction of normal and aberrant transcripts produced in the tissues (P <0.01 corrected by Bonferroni for one 4th test, FIGURE 39C). This contrasts with variants with predicted high scores (Score A> 0.8), which were significantly less likely to produce tissue-specific effects (P = 0.015). Our findings are in line with the previous observation that exons that were alternately spliced tend to have intermediate prediction scores (FIGURE 37C), compared to exons that were inserted or removed by splice constitutively, whose scores are close to 1 or O , respectively.
[00301] [00301] These results support a model in which tissue specific factors, such as the chromatin context and the binding of RNA-binding proteins, can make the dispute between two splice junctions that are close in favoritism (Gelfman et al., 2013; Luco et al., 2010; Shukla et al., 2011; Ule et al., 2003). The strong cryptic splice variants probably shift the splicing from the normal to the aberrant isoform entirely, regardless of the epigenetic context, while the weaker variants bring the splice junction selection closer to the decision limit, resulting in alternative use of the junction in different types of tissues and cellular contexts. | sso highlights the unexpected role played by cryptic splice mutations in the generation of a new diversity of alternative splicing, as natural selection would have the opportunity to preserve mutations that create an alternative splicing specific to useful tissue.
[00302] [00302] Although the predicted cryptic splice variants are validated at a high rate in the RNA-seq, in many cases the effects are not fully penetrating and a mixture of normal and aberrant splice isoforms is produced, increasing the possibility that a fraction these cryptic splice alteration variants may not be functionally significant. To explore the signature of natural selection in the predicted cryptic splice variants, we scored each variant present in 60,706 human exomes from the Exome Aggregation Consortium (ExAC) database (Lek et al., 2016) and identified variants that were predicted to alter the limits exon-intron.
[00303] [00303] To measure the extent of the negative selection acting on the predicted splice change variants, we count the number of predicted splice change variants found in common allelic frequencies (20.1% in the human population) and compare them with the number of variants splice alterations predicted in the frequencies of the singleton alleles in ExXAC (ie, in 1 out of 60.706 individuals). Due to the recent exponential expansion in the size of the human population, singleton variants represent recently created mutations that have been minimally filtered by purifying selection (Tennessen et al., 2012). On the other hand, common variants represent a subset of neutral mutations that have passed through the sieve of purifying selection. Therefore, the depletion of predicted splice change variants in the common allele frequency spectrum in relation to the singleton variants provides an estimate of the fraction of predicted splice change variants that are deleterious and therefore functional. To avoid confounding effects in the protein coding sequence, we restricted our analysis to synonymous variants and intronic variants that are outside the essential GT or AG dinucleotides, excluding missense mutations that are also predicted to have splice alteration effects.
[00304] [00304] In common allele frequencies, the cryptic splice variants predicted with confidence (Score A> 0.8) are under strong negative selection, as evidenced by their relative depletion compared to expectation (FIGURE 40A). At this limit, where most variants are expected to be almost entirely penetrating RNA-seqg data (FIGURE 38D), predicted synonymous and intronic cryptic splice mutations are depleted by 78% at common allelic frequencies, comparable to 82% of depletion of frameshift, stop-gain and essential variants of splice interruption of GT or AG (FIGURE 40B). The impact of negative selection is greater when considering cryptic splice variants that would cause frameshifts in relation to those that cause changes in the frame (FIGURE 40B). The depletion of cryptic splice variants resulting from frameshift is almost identical to that of other classes of protein truncation variation, indicating that the vast majority of cryptic splice mutations reliably predicted in the quasi-intronic region (<50 nt exon boundaries - known introns) are functional and have highly harmful effects on the human population.
[00305] [00305] To extend this analysis to deep intronic regions> 50 nt from the known limits of exon-intron, we used aggregate data from total genomic sequencing of 15,496 humans from the Genome Aggregation Database (gnomAD) cohort (Lek et al., 2016 ) to calculate the observed and expected counts of cryptic splice mutations at common allelic frequencies. Overall, we observed a 56% depletion of common cryptic splice mutations (Score A> 0.8) at a distance> 50 nt from an exon-intron limit (FIGURE 40D), consistent with greater difficulty in predicting the impact of deep intronic variants, as noted in the RNA-seq data.
[00306] [00306] Next, we seek to estimate the potential for cryptic splice mutations to contribute to penetrating genetic disease, in relation to other types of protein coding variation, by measuring the number of rare cryptic splice mutations per individual in the gnomAD cohort . Based on the fraction of predicted cryptic splice mutations that are under negative selection (FIGURE 40A), the average human carries -5 rare functional cryptic splice mutations (allele frequency <0.1%), compared to —11 variants of protein truncation (FIGURE 40E). The cryptic splice variants outperform the essential GT or AG splice interrupt variants by approximately 2: 1. We caution that a significant fraction of these cryptic splice variants may not completely impair the function of the gene, either because they produce in-frtame changes or because they do not completely change the splicing for the aberrant isoform.
[00307] [00307] Large-scale sequencing studies of patients with autism spectrum disorders and severe intellectual disability have demonstrated the central role of de novo protein coding mutations (missense, nonsense, frameshift and essential splice dinucleotide) that interrupt genes in neurodevelopment pathways (Fitzgerald et al., 2015; lossifov et al., 2014; McRae et al., 2017; Neale et al., 2012; De Rubeis et al., 2014; Sanders et al., 2012). To assess the clinical impact of non-coding mutations that act through altered splicing, we applied the neural network to predict the effects of de novo mutations in 4,293 individuals with intellectual disabilities in the Deciphering Developmental Disorders (DDD) cohort ( McRae et al., 2017), 3,953 individuals with autism spectrum disorders (ASD) from the Simons Simplex Collection (De Rubeis et al., 2014; Sanders et al., 2012; Turner et al., 2016) and Autism Sequencing Consortium and 2,073 simons Simplex Collection unaffected peer controls. To control the differences in verifying the variant again between studies, we normalized the expected number of variants again, so that the number of synonymous mutations per individual was the same between the cohorts.
[00308] [00308] De novo mutations that are predicted to stop splicing are enriched by 1.51 times in intellectual disability (P =
[00309] [00309] To estimate the enrichment of cryptic splice mutations in disease candidate genes compared to chance, we calculated the probability of calling a cryptic splice mutation again for each individual gene using the trinucleotide context to adjust the mutation rate ( Samocha et al., 2014) (Table S4). The combination of cryptic splice mutations and protein encoding in new gene discoveries produces 5 additional candidate genes associated with intellectual disability and 2 additional genes associated with autism spectrum disorder (FIGURE 41D and FIGURE 45) that would be below the discovery limit ( FDR <0.01) when considering only protein-coding mutations (Kosmicki et al., 2017; Sanders et al., 2015).
[00310] [00310] We obtained peripheral lymphoblastoid cell lines (LCLs) derived from peripheral blood from 36 individuals from the Simons Simplex Collection, which harbored predicted cryptic de novo splice mutations in genes with at least a minimal level of LCL expression (De Rubeis et al. , 2014; Sanders et al., 2012); each individual represented the only case of autism within his immediate family. As is the case with most rare genetic diseases, the type of tissue and cell of relevance (presumably from the developing brain) was not accessible. Therefore, we performed high-depth mRNA sequencing (- 350 million x 150 bp of unique readings per sample, approximately 10 times the GTEx coverage) to compensate for the poor expression of many of these transcripts in LCLs. To ensure validation of a representative set of predicted cryptic splice variants, instead of simply the main predictions, we apply relatively permissive limits (Score A> 0.1 for splice loss variants and Score A> 0.5 for gain splice; STAR methods) and experimental validation was performed on all de novo variants that met these criteria.
[00311] [00311] After excluding 8 individuals who had insufficient RNA-seq coverage in the gene of interest, we identified exclusive aberrant splicing events associated with the predicted de novo cryptic splice mutation in 21 of the 28 patients (FIGURE 41E and FIGURES 51A, 51B, 51C , 51D, 51E, 51F, 51G, 51H, 51l and 51J). These aberrant splicing events were absent in the other 35 individuals for whom deep RNA-seq LCL was obtained, as well as in the 149 individuals in the GTEx cohort. Among the 21 confirmed cryptic splice mutations again, we observed 9 cases of new junction creation, 8 cases of exon jumps and 4 cases of intron retention, as well as more complex splicing aberrations.
[00312] [00312] The high validation rate of predicted cryptic splice mutations in patients with autism spectrum disorder (75%), despite the limitations of the RNA-sec assay, indicates that most predictions are functional. However, the enrichment of cryptic splice variants again in cases compared to controls (1.5 times in DDD and 1.3 times in ASD, FIGURE 41A) represents only 38% of the size of the effect observed for protein truncation variants again (2.5 times in DDD and 1.7 times in ASD) (lossifov et al., 2014; McRae et al., 2017; De Rubeis et al., 2014). This allows us to quantify that cryptic functional splice mutations have about 50% of the clinical penetration of classic forms of protein-truncating mutation (stop-gain, frameshift and essential splice dinucleotide), due to the fact that many of them only partially interrupt the production of normal transcription. In fact, some of the best characterized cryptic splice mutations in Mendelian diseases, such as c.315-48T> C in FECH (Gouya et al., 2002) and c.-32-13T> G in GAA (Boerkoel et al. , 1995), are hypomorphic alleles associated with a milder phenotype or more advanced age at onset. The clinical penetrance estimate is calculated for all de novo variants that meet a relatively permissive limit (Score A> 0.1), and variants with stronger forecast scores would be expected to have correspondingly higher penetrance.
[00313] [00313] Based on excess de novo mutations in cases versus controls in the ASD and DDD cohorts, 250 cases can be explained by cryptic splice mutations again compared to
[00314] [00314] We describe systems, methods and articles of manufacture to use an atrous convolutional neural network trained to detect splice sites in a genomic sequence (for example, a nucleotide sequence or an amino acid sequence). One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. The omission of some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the previous sections - these recitations are incorporated in this document below by reference in each of the following implementations.
[00315] [00315] Stasis uses the terms module (s) and stage (s) interchangeably.
[00316] [00316] A system implementation of the disclosed technology includes one or more processors coupled to the memory. The memory is loaded with instructions from the computer to train a splice site detector that identifies splice sites in genomic sequences (for example, nucleotide sequences).
[00317] [00317] As shown in FIGURE 30, the system trains an atrous convolutional neural network (abbreviated ACNN) in at least 50,000 examples of donor splice site training, at least 50,000 examples of acceptor splice site training and at least 100,000 examples training site without splicing. Each training example is a target nucleotide sequence that has at least one target nucleotide flanked by at least 20 nucleotides on each side.
[00318] [00318] An ACNN is a convolutional neural network that uses atrous / dilated convolutions that allow large receptive fields with few trainable parameters. An atrous / dilated convolution is a convolution in which the nucleus is applied over an area greater than its length, skipping the input values with a certain step, also called atrous convolution rate or dilation factor. Atrous / dilated convolutions add spacing between the elements of a convolution filter / nucleus, so that neighboring inputs (eg nucleotides, amino acids) at longer intervals are considered when a convolution operation is performed. This allows the incorporation of long-range contextual dependencies into the entry. Atrous convolutions retain partial convolution calculations for reuse as the adjacent nucleotides are processed.
[00319] [00319] As shown in FIGURE 30, to evaluate a training example using ACNN, the system provides, as input to ACNN, a target nucleotide sequence flanked by at least 40 upstream context nucleotides and at least 40 nucleotides from downstream context.
[00320] [00320] As shown in FIGURE 30, based on the evaluation, ACNN then produces, as an output, triple scores for the probability that each nucleotide in the target nucleotide sequence is a splice donor site, a splice acceptor site or a site not splicing.
[00321] [00321] This implementation of the system and other disclosed systems optionally includes one or more of the following resources. The system may also include features described in connection with the disclosed methods. For the sake of brevity, the alternative combinations of system resources are not listed individually. The resources applicable to manufacturing systems, methods and articles are not repeated for each set of statutory classes of basic resources. The reader will understand how the resources identified in this section can be easily combined with the basic resources of other statutory classes.
[00322] [00322] As shown in FIGURES 25, 26 and 27, the entry can comprise a target nucleotide sequence that has a target nucleotide flanked by 2500 nucleotides on each side. In such an implementation, the target nucleotide sequence is further flanked by 5000 upstream context nucleotides and 5000 downstream context nucleotides.
[00323] [00323] The entry can comprise a target nucleotide sequence that has a target nucleotide flanked by 100 nucleotides on each side. In such an implementation, the target nucleotide sequence is further flanked by 200 upstream context nucleotides and 200 downstream context nucleotides.
[00324] [00324] The entry can comprise a target nucleotide sequence that has a target nucleotide flanked by 500 nucleotides on each side. In such an implementation, the target nucleotide sequence is further flanked by 1000 upstream context nucleotides and 1000 downstream context nucleotides.
[00325] [00325] As shown in FIGURE 28, the system can train ACNN in 150,000 donor splice site training examples, 150000 acceptor splice site training examples and 800,000,000 non-splicing site training examples.
[00326] [00326] As shown in FIGURE 19, ACNN can comprise groups of residual blocks arranged in a sequence from smallest to largest. Each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a size of the convolution window of the residual blocks and an atrous convolution rate of the residual blocks.
[00327] [00327] As shown in FIGURES 21, 22, 23 and 24, in ACNN, the atrous convolution rate progresses non-exponentially from a group of lower residual blocks to a group of higher residual blocks.
[00328] [00328] As shown in FIGURES 21, 22, 23 and 24, on ACNN, the size of the convolution window varies between groups of residual blocks.
[00329] [00329] ACNN can be configured to evaluate an entry comprising a target nucleotide sequence flanked by 40 upstream context nucleotides and 40 downstream context nucleotides. In this implementation, ACNN includes a group of four residual blocks and at least one skip connection. Each residual block has 32 convolution filters, 11 convolution window sizes and 1 atrous convolution rate. This implementation of ACNN is referred to in this document as "SpliceNet80" and is shown in FIGURE 21.
[00330] [00330] ACNN can be configured to evaluate an entry comprising a target nucleotide sequence flanked by 200 upstream context nucleotides and 200 downstream context nucleotides. In such an implementation, ACNN includes at least two groups of four residual blocks and at least two skip connections. Each residual block in a first group has 32 convolution filters, 11 convolution window sizes and 1 atrous convolution rate. Each residual block in a second group has 32 convolution filters, 11 convolution window sizes and 4 atrous convolution rates. This implementation of ACNN is referred to in this document as "SpliceNet400" and is shown in FIGURE 22.
[00331] [00331] The ACNN can be configured to evaluate an entry comprising a target nucleotide sequence flanked by 1000 upstream context nucleotides and 1000 downstream context nucleotides. In such an implementation, ACNN includes at least three groups of four residual blocks and at least three skip connections. Each residual block in a first group has 32 convolution filters, 11 convolution window sizes and 1 atrous convolution rate. Each residual block has 32 convolution filters, 11 convolution window sizes and 4 atrous convolution rates. Each residual block in a third group has 32 convolution filters, 21 convolution window sizes and 19 atrous convolution rates. This implementation of ACNN is referred to in this document as "SpliceNet2000" and is shown in FIGURE 23.
[00332] [00332] The ACNN can be configured to evaluate an entry comprising a target nucleotide sequence flanked by 5000 upstream context nucleotides and 5000 downstream context nucleotides. In such an implementation, ACNN includes at least four groups of four residual blocks and at least four skip connections. Each residual block in a first group has 32 convolution filters, 11 convolution window sizes and 1 atrous convolution rate. Each residual block in a second group has 32 convolution filters, 11 convolution window sizes and 4 atrous convolution rates. Each residual block in a third group has 32 convolution filters, 21 convolution window sizes and 19 atrous convolution rates. Each residual block in a fourth group has 32 convolution filters, 41 convolution window sizes and 25 atrous convolution rates. This implementation of ACNN is referred to in this document as "SpliceNet10000" and is shown in FIGURE 24.
[00333] [00333] The triple scores for each nucleotide in the target nucleotide sequence can be exponentially normalized to add to the unit. In such an implementation, the system classifies each nucleotide in the target nucleotide as the donor splice site, the acceptor splice site or the non-splicing site based on a higher score in the respective triple scores.
[00334] [00334] As shown in FIGURE 35, the dimensionality of the ACNN entry can be defined as (C "+ L + Cº) x 4, where C" is a number of context nucleotides upstream, Cº is a number of nucleotides downstream context and L is a number of nucleotides in the target nucleotide sequence. In an implementation, the dimensionality of the entry is (5000 + 5000 + 5000) x 4.
[00335] [00335] As shown in FIGURE 35, the dimensionality of the ACNN output can be defined as L x 3. In an implementation, the dimensionality of the output is 5000 x 3.
[00336] [00336] As shown in FIGURE 35, each group of residual blocks can produce an intermediate output by processing a previous input. The dimensionality of the intermediate output can be defined as (1 - KOW-1) * DY * AI) x N, where | is the dimensionality of the previous entry, W is the size of the residual blocks convolution window, D is the atrous convolution rate of the residual blocks, A is a number of atrous convolution layers in the group and N is a number of convolution filters in the residual blocks.
[00337] [00337] As shown in FIGURE 32, ACNN in batches evaluates training examples during one season. Training examples are randomly sampled in batches. Each batch has a predetermined size. ACNN repeats the evaluation of training examples at various times (for example, 1-10).
[00338] [00338] The entry can comprise a target nucleotide sequence that has two adjacent target nucleotides. The two adjacent target nucleotides can be adenine (abbreviated A) and guanine (abbreviated G). The two adjacent target nucleotides can be guanine (abbreviated G) and uracil (abbreviated U).
[00339] [00339] The system includes a one-hot encoder (shown in FIGURE 29) that sparsely encodes the training examples and provides one-hot encodings as input.
[00340] [00340] ACNN can be parameterized by a number of residual blocks, a number of skip connections and a number of residual connections.
[00341] [00341] ACNN can comprise convolution layers that alter the dimensionality that reshape the spatial and resource dimensions of a previous entry.
[00342] [00342] As shown in FIGURE 20, each residual block may comprise at least one batch normalization layer, at least one rectified linear unit layer (abbreviated ReLU), at least one atrous convolution layer and at least one residual connection. In such an implementation, each residual block comprises two layers of batch normalization, two layers of ReLU non-linearity, two layers of atrous convolution and a residual connection.
[00343] [00343] Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may include a method that performs actions on the system described above.
[00344] [00344] Another system implementation of the disclosed technology includes a trained splice site predictor that runs on multiple processors operating in parallel and coupled to memory. The system trains an atrous convolutional neural network (abbreviated ACNN), which is performed on multiple processors, in at least 50,000 examples of training at splice donor sites, at least 50,000 examples of training at splice acceptor sites and at least 100,000 examples of training on non-splicing sites. Each of the training examples used in the training is a nucleotide sequence that includes a target nucleotide flanked by at least 400 nucleotides on each side.
[00345] [00345] The system includes an ACNN input stage that runs on at least one of several processors and feeds an input sequence of at least 801 nucleotides for evaluation of the target nucleotides. Each target nucleotide is flanked by at least 400 nucleotides on each side. In other implementations, the system includes an ACNN input module that runs on at least one of several processors and feeds an input sequence of at least 801 nucleotides to assess target nucleotides.
[00346] [00346] The system includes an ACNN exit stage that runs on at least one of several processors and translates the analysis by ACNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site. In other implementations, the system includes an ACNN exit stage that runs on at least one of several processors and translates the analysis by ACNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site.
[00347] [00347] Each of the features discussed in this specific implementation section for the first system implementation applies equally to that system implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00348] [00348] ACNN can be trained on 150000 examples of splice donor site training, 150000 examples of splice acceptor site training and 800000000 examples of non-splicing site training. In another implementation of the system, ACNN comprises groups of residual blocks arranged in a sequence from the smallest to the largest. In yet another implementation, each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a size of the convolution window of the residual blocks and an atrous convolution rate of the residual blocks.
[00349] [00349] ACNN can comprise groups of residual blocks arranged in a sequence from smallest to largest. Each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a size of the convolution window of the residual blocks and an atrous convolution rate of the residual blocks.
[00350] [00350] At ACNN, the atrous convolution rate progresses non-exponentially from a group of lower residual blocks to a group of higher residual blocks. Also at ACNN, the size of the convolution window varies between groups of residual blocks.
[00351] [00351] ACNN can be trained on one or more training servers, as shown in FIGURE 18.
[00352] [00352] The trained ACNN can be deployed on one or more production servers that receive input strings from requesting customers, as shown in FIGURE 18. In such an implementation, production servers process the input strings through the input and output stages. output from ACNN to produce outputs that are transmitted to customers, as shown in FIGURE 18. In other implementations, production servers process input segments through ACNN's input and output modules to produce outputs that are transmitted to customers, as shown in FIGURE 18.
[00353] [00353] Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may include a method that performs actions on the system described above.
[00354] [00354] A method implementation of the disclosed technology includes training a splice site detector that identifies splice sites in genomic sequences (for example, nucleotide sequences).
[00355] [00355] The method includes feeding, to an atrous convolutional neural network (abbreviated ACNN), an input sequence of at least 801 nucleotides for evaluation of the target nucleotides, which are each flanked by at least 400 nucleotides of each side.
[00356] [00356] ACNN is trained in at least 50,000 examples of splice donor site training, at least 50,000 examples of splice acceptor site training and at least 100,000 examples of non-splicing site training. Each of the training examples used in the training is a nucleotide sequence that includes a target nucleotide flanked by at least 400 nucleotides on each side.
[00357] [00357] The method also includes translating the analysis by ACNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site.
[00358] [00358] Each of the features discussed in this specific implementation section for the first system implementation applies equally to this method implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00359] [00359] Other implementations may include a non-transitory computer-readable storage medium, storing instructions executable by a processor to perform actions of the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in memory, to execute the method described above.
[00360] [00360] We describe systems, methods and articles of manufacture to use an atrous convolutional neural network trained to detect aberrant splicing in genomic sequences (for example, nucleotide sequences). One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. The omission of some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the previous sections - these recitations are incorporated in this document below by reference in each of the following implementations.
[00361] [00361] A system implementation of the disclosed technology includes one or more processors coupled to the memory. The memory is loaded with instructions from the computer to implement an aberrant splicing detector running on multiple processors operating in parallel and coupled to the memory.
[00362] [00362] As shown in FIGURE 34, the system includes a trained atrous convolutional neural network (abbreviated ACNN) that runs on the various processors. An ACNN is a convolutional neural network that uses atrous / dilated convolutions that allow for large receptive fields with few trainable parameters. An atrous / dilated convolution is a convolution in which the nucleus is applied over an area greater than its length, skipping the input values with a certain step, also called atrous convolution rate or dilation factor. Atrous / dilated convolutions add spacing between the elements of a convolution filter / nucleus, so that neighboring inputs (eg nucleotides, amino acids) at longer intervals are considered when a convolution operation is performed. This allows the incorporation of long-range contextual dependencies into the entry. Atrous convolutions retain partial convolution calculations for reuse as the adjacent nucleotides are processed.
[00363] [00363] As shown in FIGURE 34, ACNN classifies the target nucleotides in an input sequence and assigns scores at the splice site to the probability that each of the target nucleotides is a splice donor site, a splice acceptor site, or a non-splicing site. The input sequence comprises at least 801 nucleotides and each target nucleotide is flanked by at least 400 nucleotides on each side.
[00364] [00364] As shown in FIGURE 34, the system also includes a classifier, which runs on at least one of several processors, which processes a reference sequence and a variant sequence through ACNN to produce scores at the splice site for a probability that each target nucleotide in the reference sequence and in the variant sequence is a splice donor site, a splice acceptor site, or a non-splicing site. The reference sequence and the variant sequence each have at least 101 target nucleotides and each target nucleotide is flanked by at least 400 nucleotides on each side. FIGURE 33 depicts a reference sequence and an alternative / variant sequence.
[00365] [00365] As shown in FIGURE 34, the system determines, from the differences in the target nucleotide splice site scores in the reference sequence and in the variant sequence, whether a variant that generated the variant sequence causes aberrant splicing and is therefore , pathogenic.
[00366] [00366] This implementation of the system and other disclosed systems optionally includes one or more of the following resources. The system may also include features described in connection with the disclosed methods. For the sake of brevity, the alternative combinations of system resources are not listed individually. The resources applicable to manufacturing systems, methods and articles are not repeated for each set of statutory classes of basic resources. The reader will understand how the resources identified in this section can be easily combined with the basic resources of other statutory classes.
[00367] [00367] As shown in FIGURE 34, differences in the scores at the splice site can be determined in terms of position between the target nucleotides in the reference sequence and in the variant sequence.
[00368] [00368] As shown in FIGURE 34, for at least one target nucleotide position, when an overall maximum difference in the splice site scores is above a predetermined limit, ACNN classifies the variant as causing aberrant and therefore pathogenic splicing .
[00369] [00369] As shown in FIGURE 17, for at least one target nucleotide position, when an overall maximum difference in the splice site scores is above a predetermined limit, ACNN classifies the variant as causing aberrant and therefore pathogenic splicing .
[00370] [00370] The limit can be determined from a plurality of candidate limits. This includes processing a first set of reference and variant sequence pairs generated by common benign variants to produce a first set of aberrant splicing detections, processing a second set of reference and variant sequence pairs generated by variants rare pathogenic to produce a second set of aberrant splicing detections and selection of at least one limit for use by the classifier, which maximizes the count of aberrant splicing detections in the second set and minimizes the count of aberrant splicing detections in the first set.
[00371] [00371] In one implementation, ACNN identifies variants that cause autism spectrum disorder (abbreviated ASD). In another implementation, ACNN identifies variants that cause developmental delay disorder (abbreviated DDD).
[00372] [00372] As shown in FIGURE 36, the reference sequence and the variant sequence can each have at least 101 target nucleotides and each target nucleotide can be flanked by at least 5000 nucleotides on each side.
[00373] [00373] As shown in FIGURE 36, the scores on the splice site of the target nucleotides in the reference sequence can be coded on a first exit from the ACNN and the scores on the splice site of the target nucleotides in the variant sequence can be coded on a second exit from ACNN. In an implementation, the first output is coded as a first 101 x 3 matrix and the second output is coded as a second 101 x 3 matrix.
[00374] [00374] As shown in FIGURE 36, in such an implementation, each line in the first 101 x 3 matrix represents only splice site scores for a probability that a target nucleotide in the reference sequence is a splice donor site, a splice acceptor site. splice or a non-splicing site.
[00375] [00375] As shown in FIGURE 36, also in that implementation, each line in the second matrix 101 x 3 represents only scores at the splice site for a probability that a target nucleotide in the variant sequence is a splice donor site, a splice or a non-splicing site.
[00376] [00376] As shown in FIGURE 36, in some implementations, the splice site scores on each row of the first 101 x 3 matrix and the second 101 x 3 matrix can be exponentially normalized to add to the unit.
[00377] [00377] As shown in FIGURE 36, the classifier can perform a line-by-line comparison of the first 101 x 3 matrix and the second 101 x 3 matrix and determine, on a line-based basis, changes in the distribution of scores on the site splice. For at least one instance of the line-by-line comparison, when the change in distribution is above a predetermined limit, ACNN classifies the variant as causing aberrant and therefore pathogenic splicing.
[00378] [00378] The system includes a one-hot encoder (shown in FIGURE 29) that sparsely encodes the reference sequence and the variant sequence.
[00379] [00379] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that system implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00380] [00380] Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may include a method that performs actions on the system described above.
[00381] [00381] A method of implementing the disclosed technology includes the detection of genomic variants that cause aberrant splicing.
[00382] [00382] The method includes processing a reference sequence through an atrous convolutional neural network (abbreviated ACNN) trained to detect differential splicing patterns in a target sub-sequence of an input sequence, classifying each nucleotide in the sub-sequence target as a splice donor site, a splice acceptor site, or a non-splicing site.
[00383] [00383] The method includes, based on processing, the detection of a first differential splicing pattern in a reference target sub-sequence, classifying each nucleotide in the reference target sub-sequence as a splice donor site, an acceptor site splice or a non-splicing site.
[00384] [00384] The method includes the processing of a variant sequence through ACNN. The variant sequence and the reference sequence differ in at least one variant nucleotide located in a variant target sub-sequence.
[00385] [00385] The method includes, based on processing, the detection of a first differential splicing pattern in a variant target sub-sequence, classifying each nucleotide in the variant target sub-sequence as a splice donor site, a splice acceptor site or a non-splicing site.
[00386] [00386] The method includes determining a difference between the first differential splicing pattern and the second differential splicing pattern by comparing, on a nucleotide-by-nucleotide basis, classifications of the splice site of the target reference sub-sequence and the target sub-sequence variant.
[00387] [00387] When the difference is above a predetermined limit, the method includes the classification of the variant as causing aberrant and therefore pathogenic splicing and the storage of the classification in memory.
[00388] [00388] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that method implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00389] [00389] A differential splicing pattern can identify the positional distribution of the occurrence of splicing events in a sub-
[00390] [00390] The reference target sub-sequence and the variant target sub-sequence can be aligned with respect to the nucleotide positions and may differ by at least one variant nucleotide.
[00391] [00391] The reference target sub-sequence and the variant target sub-sequence can each have at least 40 nucleotides and each can be flanked by at least 40 nucleotides on each side.
[00392] [00392] The reference target sub-sequence and the variant target sub-sequence can each have at least 101 nucleotides and each can be flanked by at least 5000 nucleotides on each side.
[00393] [00393] The variant target sub-sequence can include two variants.
[00394] [00394] Other implementations may include a non-transitory computer-readable storage medium, storing instructions executable by a processor to perform actions of the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in memory, to execute the method described above.
[00395] [00395] We describe systems, methods and articles of manufacture to use an atrous convolutional neural network trained to detect aberrant splice and splicing sites in genomic sequences (for example, nucleotide sequences). One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. The omission of some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the previous sections - these recitations are incorporated in this document below by reference in each of the following implementations.
[00396] [00396] A system implementation of the disclosed technology includes one or more processors coupled to the memory. The memory is loaded with instructions from the computer to train a splice site detector that identifies splice sites in genomic sequences (for example, nucleotide sequences).
[00397] [00397] The system trains an atrous convolutional neural network (abbreviated ACNN) in at least 50,000 examples of training at splice donor sites, at least 50,000 examples of training at splice accepting sites and at least 100,000 examples of training at non-splice sites splicing. Each training example is a target nucleotide sequence that has at least one target nucleotide flanked by at least nucleotides on each side.
[00398] [00398] To evaluate a training example using CNN, the system provides, as input to CNN, a target nucleotide sequence flanked by at least 40 nucleotides in the upstream context and at least 40 nucleotides in the downstream context.
[00399] [00399] Based on the assessment, CNN then outputs triple scores for the probability that each nucleotide in the target nucleotide sequence is a splice donor site, a splice acceptor site or a non-splicing site.
[00400] [00400] This implementation of the system and other disclosed systems optionally includes one or more of the following resources. The system may also include features described in connection with the disclosed methods. For the sake of brevity, the alternative combinations of system resources are not listed individually. The resources applicable to manufacturing systems, methods and articles are not repeated for each set of statutory classes of basic resources. The reader will understand how the resources identified in this section can be easily combined with the basic resources of other statutory classes.
[00401] [00401] The entry can comprise a target nucleotide sequence that has a target nucleotide flanked by 100 nucleotides on each side. In such an implementation, the target nucleotide sequence is further flanked by 200 upstream context nucleotides and 200 downstream context nucleotides.
[00402] [00402] As shown in FIGURE 28, the system can train CNN in 150,000 examples of training from splice donor sites, 150,000 examples of training at splice accepting sites and 1000000 examples of training at non-splicing sites.
[00403] [00403] As shown in FIGURE 31, CNN can be parameterized by a number of convolution layers, a number of convolution filters and a number of subsampling layers (for example, maximum pool and average pool).
[00404] [00404] As shown in FIGURE 31, CNN can include one or more fully connected layers and a terminal classification layer.
[00405] [00405] CNN can comprise convolution layers that alter the dimensionality that reshape the spatial and resource dimensions of a previous entry.
[00406] [00406] Triple scores for each nucleotide in the target nucleotide sequence can be exponentially normalized to add to the unit. In such an implementation, the system classifies each nucleotide in the target nucleotide as the splice donor site, the splice acceptor site or the non-splicing site based on a higher score in the respective triple scores.
[00407] [00407] As shown in FIGURE 32, CNN in terms of lots evaluates the training examples during an era. Training examples are randomly sampled in batches. Each batch has a predetermined size. CNN repeats the evaluation of the training examples at various times (for example, 1-10).
[00408] [00408] The entry may comprise a target nucleotide sequence that has two adjacent target nucleotides. The two adjacent target nucleotides can be adenine (abbreviated A) and guanine (abbreviated G). The two adjacent target nucleotides can be guanine (abbreviated G) and uracil (abbreviated U).
[00409] [00409] The system includes a one-hot encoder (shown in FIGURE 32) that sparsely encodes the training examples and provides one-hot encodings as input.
[00410] [00410] CNN can be parameterized by a number of residual blocks, a number of skip connections and a number of residual connections.
[00411] [00411] Each residual block comprises at least one batch normalization layer, at least one rectified linear unit layer (abbreviated ReLU), at least one dimensional change layer and at least one residual connection. Each residual block comprises two layers of batch normalization, two layers of ReLU non-linearity, two layers of dimensional change and a residual connection.
[00412] [00412] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that system implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00413] [00413] Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may include a method that performs actions on the system described above.
[00414] [00414] Another system implementation of the disclosed technology includes a trained splice site predictor that runs on several processors operating in parallel and coupled to the memory. The system trains a convolutional neural network (abbreviated CNN), which runs on multiple processors, in at least 50,000 examples of training at splice donor sites, at least 50,000 examples of training at splice acceptor sites and at least 100,000 training examples on non-splicing sites. Each of the training examples used in the training is a nucleotide sequence that includes a target nucleotide flanked by at least 400 nucleotides on each side.
[00415] [00415] The system includes a CNN input stage that runs on at least one of several processors and feeds an input sequence of at least 801 nucleotides for evaluation of the target nucleotides. Each target nucleotide is flanked by at least 400 nucleotides on each side. In other implementations, the system includes a CNN input module that runs on at least one of several processors and feeds an input sequence of at least 801 nucleotides for evaluation of the target nucleotides.
[00416] [00416] The system includes a CNN exit stage that runs on at least one of several processors and translates CNN analysis into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site. In other implementations, the system includes an exit stage from CNN that runs on at least one of several processors and translates the analysis by CNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site.
[00417] [00417] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that system implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00418] [00418] CNN can be trained in 150000 examples of training splice donor sites, 150000 examples of training splice accepting sites and 800000000 examples of training non-splicing sites.
[00419] [00419] CNN can be trained on one or more training servers.
[00420] [00420] The trained CNN can be deployed on one or more production servers that receive input strings from requesting customers. In such an implementation, production servers process input strings through CNN's input and output stages to produce outputs that are transmitted to customers. In other implementations, production servers process input strings through CNN's input and output outputs to produce outputs that are transmitted to customers.
[00421] [00421] Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may include a method that performs actions on the system described above.
[00422] [00422] A method implementation of the disclosed technology includes training a splice site detector that identifies splice sites in genomic sequences (for example, nucleotide sequences). The method includes feeding, to a convolutional neural network (abbreviated CNN), an input sequence of at least 801 nucleotides to evaluate the target nucleotides, which are each flanked by at least 400 nucleotides on each side.
[00423] [00423] CNN is trained in at least 50,000 examples of splice donor site training, at least 50,000 examples of splice acceptor site training and at least 100,000 examples of non-splicing site training. Each of the training examples used in the training is a nucleotide sequence that includes a target nucleotide flanked by at least 400 nucleotides on each side.
[00424] [00424] The method also includes translating the analysis by CNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site.
[00425] [00425] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that method implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00426] [00426] Other implementations may include a non-transitory computer-readable storage medium, storing instructions executable by a processor to perform actions of the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in memory, to execute the method described above.
[00427] [00427] A system implementation of the disclosed technology includes one or more processors coupled to the memory. The memory is loaded with instructions from the computer to implement an aberrant splicing detector running on multiple processors operating in parallel and coupled to the memory.
[00428] [00428] The system includes a trained convolutional neural network (abbreviated CNN) running on multiple processors.
[00429] [00429] As shown in FIGURE 34, CNN classifies the target nucleotides in an input sequence and assigns scores at the splice site to the probability that each of the target nucleotides is a splice donor site, a splice acceptor site, or a non-splicing site. The input sequence comprises at least 801 nucleotides and each target nucleotide is flanked by at least 400 nucleotides on each side.
[00430] [00430] As shown in FIGURE 34, the system also includes a classifier, which runs on at least one of several processors, which processes a reference sequence and a variant sequence through CNN to produce scores at the splice site for a probability that each target nucleotide in the reference sequence and in the variant sequence is a splice donor site, a splice acceptor site, or a non-splicing site. The reference sequence and the variant sequence each have at least 101 target nucleotides and each target nucleotide is flanked by at least 400 nucleotides on each side.
[00431] [00431] As shown in FIGURE 34, the system determines, from the differences in the splice site scores of the target nucleotides in the reference sequence and in the variant sequence, whether a variant that generated the variant sequence causes aberrant splicing and is therefore , pathogenic.
[00432] [00432] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that system implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00433] [00433] Differences in scores at the splice site can be determined in terms of position between the target nucleotides in the reference sequence and in the variant sequence.
[00434] [00434] —Parapelomenos a target nucleotide position, when a global maximum difference in the splice site scores is above a predetermined limit, CNN classifies the variant as causing aberrant and therefore pathogenic splicing.
[00435] [00435] Parapelomenos a target nucleotide position, when a maximum global difference in the splice site scores is above a predetermined limit, CNN classifies the variant as causing aberrant and therefore pathogenic splicing.
[00436] [00436] The limit can be determined from a plurality of candidate limits. This includes processing a first set of reference and variant sequence pairs generated by common benign variants to produce a first set of aberrant splicing detections, processing a second set of reference and variant sequence pairs generated by variants rare pathogenic to produce a second set of aberrant splicing detections and selection of at least one limit for use by the classifier, which maximizes the count of aberrant splicing detections in the second set and minimizes the count of aberrant splicing detections in the first set.
[00437] [00437] In one implementation, CNN identifies variants that cause autism spectrum disorder (abbreviated ASD). In another implementation, CNN identifies variants that cause developmental delay disorder (abbreviated DDD).
[00438] [00438] The reference sequence and the variant sequence each have at least 101 target nucleotides and each target nucleotide is flanked by at least 1000 nucleotides on each side.
[00439] [00439] Scores on the target nucleotide splice site in the reference sequence can be encoded on a first CNN exit and scores on the target nucleotide splice site in the variant sequence can be encoded on a second CNN exit. In an implementation, the first output is coded as a first 101 x 3 matrix and the second output is coded as a second 101 x 3 matrix.
[00440] [00440] In such an implementation, each row in the first 101 x 3 matrix represents only splice site scores for a probability that a target nucleotide in the reference sequence is a splice donor site, a splice acceptor site, or a non-splicing site. .
[00441] [00441] Also in this implementation, each line in the second matrix 101 x 3 represents only scores on the splice site for a probability that a target nucleotide in the variant sequence is a splice donor site, a splice acceptor site or a non-splicing site. .
[00442] [00442] In some implementations, the splice site scores on each row of the first 101 x 3 matrix and the second 101 x 3 matrix can be exponentially normalized to add to the unit.
[00443] [00443] The classifier can perform a line-by-line comparison of the first matrix 101 x 3 and the second matrix 101 x 3 and determine, on a line-by-line basis, changes in the distribution of the scores at the splice site. For at least one instance of the line-by-line comparison, when the change in distribution is above a predetermined limit, CNN classifies the variant as causing aberrant and therefore pathogenic splicing.
[00444] [00444] The system includes a one-hot encoder (shown in FIGURE 29) that sparsely encodes the reference sequence and the variant sequence.
[00445] [00445] Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform actions of the system described above. Yet another implementation may include a method that performs actions on the system described above.
[00446] [00446] A method of implementing the disclosed technology includes detecting genomic variants that cause aberrant splicing.
[00447] [00447] The method includes processing a reference sequence through a convolutional neural network (abbreviated CNN) trained to detect differential splicing patterns in a target sub-sequence of an input sequence, classifying each nucleotide in the target sub-sequence such as a splice donor site, a splice acceptor site, or a non-splicing site.
[00448] [00448] The method includes, based on the processing, the detection of a first differential splicing pattern in a reference target sub-sequence, classifying each nucleotide in the reference target sub-sequence as a splice donor site, an acceptor site splice or a non-splicing site.
[00449] [00449] The method includes processing a variant sequence through CNN. The variant sequence and the reference sequence differ in at least one variant nucleotide located in a variant target sub-sequence.
[00450] [00450] The method includes, based on the processing, the detection of a first differential splicing pattern in a variant target sub-sequence, classifying each nucleotide in the variant target sub-sequence as a splice donor site, a splice acceptor site or a non-splicing site.
[00451] [00451] The method includes determining a difference between the first differential splicing pattern and the second differential splicing pattern by comparing, on a nucleotide-by-nucleotide basis, classifications of the splice site of the reference target sub-sequence and the target sub-sequence variant.
[00452] [00452] When the difference is above a predetermined limit, the method includes the classification of the variant as causing an aberrant and therefore pathogenic splicing and the storage of the classification in memory.
[00453] [00453] Each of the features discussed in this specific implementation section for other system and method implementations applies equally to that method implementation. As noted above, all system features are not repeated in this document and should be considered repeated by reference.
[00454] [00454] A differential splicing pattern can identify the positional distribution of the occurrence of splicing events in a sub-
[00455] [00455] The reference target sub-sequence and the variant target sub-sequence can be aligned with respect to the nucleotide positions and can differ by at least one variant nucleotide.
[00456] [00456] The target target sub-sequence and the variant target sub-sequence can each have at least 40 nucleotides and each can be flanked by at least 40 nucleotides on each side.
[00457] [00457] The target reference sub-sequence and the variant target sub-sequence can each have at least 101 nucleotides and each can be flanked by at least 1000 nucleotides on each side.
[00458] [00458] The variant target sub-sequence can include two variants.
[00459] [00459] Other implementations may include a non-transitory computer-readable storage medium, storing instructions executable by a processor to perform actions of the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in memory, to execute the method described above.
[00460] [00460] The previous description is presented to allow the creation and use of the disclosed technology. Several changes to the disclosed implementations will be evident and the general principles defined in this document can be applied to other implementations and requests without departing from the spirit and scope of the disclosed technology. Thus, the technology disclosed is not intended to be limited to the implementations presented, but should be given the broadest scope consistent with the principles and characteristics disclosed in this document. The scope of the disclosed technology is defined by the attached claims.
[00461] [00461] FIGURE 57 represents an implementation of gene enrichment analysis. In one implementation, the aberrant splicing detector is further configured to implement an enrichment analysis by gene that determines the pathogenicity of variants that have been determined to cause aberrant splicing. For a specific gene sampled from a cohort of individuals with a genetic disorder, gene enrichment analysis includes the application of ACNN trained to identify candidate variants in the particular gene that cause aberrant splicing, determining a reference number of mutations for the gene in particular adding the observed trinucleotide mutation rates of the candidate variants and multiplying the sum by a transmission count and a cohort size, applying the trained ACNN to identify de novo variants in the specific gene that causes aberrant splicing and comparing the reference number of mutations with a count of the variants again. Based on the result of the comparison, the gene enrichment analysis determines that the specific gene is associated with the genetic disorder and that the de novo missense variants are pathogenic. In some implementations, the genetic disorder is autism spectrum disorder (abbreviated ASD). In other implementations, the genetic disorder is developmental delay disorder (abbreviated DDD).
[00462] [00462] In the example shown in FIGURE 57, five candidate variants in a particular gene have been classified as causing aberrant splicing by the aberrant splicing detector. These five candidate variants have respective observed trinucleotide mutation rates of 108, 102, 10, 10 and 10. The reference number of mutations for the particular gene is determined as 10th, based on the sum of the respective observed trinucleotide mutation rates of the five “candidate candidates” and the multiplication of the sum by a transmission / chromosome count (2) and a size cohort (1000). This is then compared to the new variant count (3).
[00463] [00463] In some implementations, the aberrant splcing detector is further configured to perform the comparison using a statistical test that produces a p-value as an output.
[00464] [00464] In other implementations, the aberrant splicing detector is further configured to compare the reference number of mutations with the new variant count and, based on the result of the comparison, determines that the specific gene is not associated with the genetic disorder and that the again variants are benign.
[00465] [00465] In one implementation, at least some of the candidate variants are protein truncation variants.
[00466] [00466] In another implementation, at least some of the candidate variants are missense variants.
[00467] [00467] FIGURE 58 represents an implementation of enrichment analysis across the genome. In one implementation, the aberrant splicing detector is further configured to implement an enrichment analysis across the genome that determines the pathogenicity of variants that have been determined to cause aberrant splicing. Genetic enrichment analysis includes the application of trained ACNN to identify a first set of de novo variants that cause aberrant splicing in a plurality of genes sampled from a cohort of healthy individuals, applying trained ACNN to identify a second set of de novo variants that cause aberrant splicing in the plurality of genes sampled from a cohort of individuals with a genetic disorder and that compare the respective counts of the first and second sets and based on a comparison output that determines that the second set of variants of novo is enriched in the cohort of individuals with genetic disorder and, therefore, pathogenic. In some implementations, the genetic disorder is autism spectrum disorder (abbreviated ASD). In other implementations, the genetic disorder is developmental delay disorder (abbreviated DDD).
[00468] [00468] In some implementations, the aberrant splcing detector is further configured to perform the comparison using a statistical test that produces a p-value as an output. In an implementation, the comparison is also parameterized by the respective cohort sizes.
[00469] [00469] In some implementations, the aberrant splicing detector is further configured to compare the respective counts of the first and second sets and based on the comparison output, determining that the second set of variants is again not enriched in the cohort of individuals with genetic disorders and therefore benign.
[00470] [00470] In the example shown in FIGURE 58, the mutation rate in the healthy cohort (0.001) and the mutation rate in the affected cohort (0.004) are illustrated, together with the mutation ratio per individual (4).
[00471] [00471] Despite the limited diagnostic yield of exome sequencing in patients with severe genetic disorders, clinical sequencing has focused on rare coding mutations, largely disregarding the variation in the non-coding genome due to the difficulty of interpretation. Here, we introduce a deep learning network that accurately predicts splicing from the primary nucleotide sequence, thereby identifying noncoding mutations that disrupt the normal pattern of exons and introns with serious consequences on the resulting protein. We show that predicted cryptic splice mutations that are validated at high rates by RNA-sec, are highly harmful in the human population and are a major cause of rare genetic diseases.
[00472] [00472] Using the deep learning network as an in silico model of the spliceosome, we were able to reconstruct the specificity determinants that allow the spliceosome to achieve its remarkable precision in vivo. We reaffirm many of the discoveries made in the last four decades of research on splicing mechanisms and show that the spliceosome integrates a large number of short- and long-range specificity determinants in its decisions. In particular, we found that the perceived degeneration of most splice motifs is explained by the presence of long-range determinants, such as lengths of exons / introns and positioning of nucleosomes, which more than compensate and make additional specificity at the level of the motif unnecessary. . Our findings demonstrate the promise of deep learning models to provide biological information, rather than just serving as black box classifiers.
[00473] [00473] Deep learning is a relatively new technique in biology, and there are still possible losses. When learning to automatically extract resources from the sequence, deep learning models may use new sequence determinants that are not well described by human experts, but there is also a risk that the model incorporates resources that do not reflect the true behavior of the Spliceossoma. These irrelevant features may increase the apparent accuracy of predicting noted exon-intron limits, but would reduce the accuracy of predicting the splice change effects of arbitrary sequence changes induced by genetic variation. Since accurate variant prediction provides the strongest evidence that the model can generalize to true biology, we provide validation of predicted splice change variants using three fully orthogonal methods: RNA-sec, natural selection in human populations and variant enrichment again in case versus control cohorts. While this does not entirely prevent the incorporation of irrelevant features into the model, the resulting model appears faithful enough to the true biology of splicing to have significant value for practical applications, such as the identification of cryptic splice mutations in patients with genetic diseases.
[00474] [00474] Compared to other classes of protein truncating mutations, a particularly interesting aspect of cryptic splice mutations is the generalized phenomenon of alternative splicing due to variants that alter the incompletely penetrating splice, which tend to weaken canonical splice sites in relation to alternative splice sites, resulting in the production of a mixture of aberrant and normal transcripts in the RNA-seq data. The observation that these variants often lead to tissue-specific alternative splicing highlights the unexpected role played by cryptic splice mutations in generating a new diversity of alternative splicing. A possible future direction would be to train deep learning models in annotations of splice junctions of the RNA-seq of the relevant tissue, thus obtaining specific models of alternative splicing tissue. Training the network in annotations derived directly from RNA-sec data also helps to fill in gaps in the GENCODE annotations, which improves the model's performance in predicting variants (FIGURES 52A and 52B).
[00475] [00475] —Our understanding of how mutations in the non-coding genome lead to human disease remains far from complete. The discovery of penetrating cryptic de novo splice mutations in childhood neurodevelopmental disorders demonstrates that an improved interpretation of the non-coding genome can directly benefit patients with severe genetic disorders. Cryptic splice mutations also play important roles in cancer (Jung et al., 2015; Sanz et al., 2010; Supek et al., 2014), and recurrent somatic mutations in splice factors have been shown to produce widespread changes in splicing specificity ( Graubert et al., 2012; Shirai et al., 2015; Yoshida et al., 2011). Much work remains to be done to understand the regulation of splicing in different tissues and cellular contexts, particularly in the case of mutations that directly impact proteins in the spliceosome. In light of recent advances in oligonucleotide therapy that could potentially target splicing defects in a specific way (Finkel et al., 2017) in sequence, a greater understanding of the regulatory mechanisms that govern this remarkable process could pave the way for new ones. candidates for therapeutic intervention.
[00476] [00476] FIGURES 37A, 37B, 37C, 37D, 37E, 37F, 37G and 37H illustrate an implementation of splicing prediction from the primary sequence with deep learning.
[00477] [00477] Regarding FIGURE 37A, for each position in the pre-mRNA transcript, SpliceNet-10k uses 10,000 flanking sequence nucleotides as input and predicts whether that position is an acceptor, donor or none of the splice.
[00478] [00478] In relation to FIGURE 37B, the complete transcription of the pre-mRNA for the CFTR gene punctuated using MaxEntScan (top) and SpliceNet-10k (bottom) is shown, together with the predicted acceptor (red arrows) and donor sites (arrows) green) and the actual exon positions (black boxes). For each method, we applied the limit that made the number of predicted sites equal to the total number of real sites.
[00479] [00479] In relation to FIGURE 37C, for each exon, we measure the exon inclusion rate in the RNA-seq and show the distribution of the SpliceNet-10k score for exons with different inclusion rates. The maximum grades of exon acceptors and donors are shown.
[00480] [00480] In relation to FIGURE 37D, the impact of the in silico mutation of each nucleotide around exon 9 in the U2SURP gene. The vertical size of each nucleotide shows the decrease in the predicted strength of the acceptor site (black arrow) when that nucleotide is mutated (Score A).
[00481] [00481] In relation to FIGURE 37E, the effect of the context size of the input sequence on the network accuracy. The top-k precision is the fraction of the splice sites correctly predicted at the limit where the number of predicted sites is equal to the actual number of sites present. PR-AUC is the area under the precision-recovery curve. We also show the top-k and PR-AUC accuracy of three other algorithms for detecting splice sites.
[00482] [00482] In relation to FIGURE 37F, relation between the length of the exon / intron and the strength of the adjacent splice sites, as predicted by SpliceNet-80nt (local motif score) and SpliceNet-10k. The distributions across the genome of exon length (yellow) and intron length (pink) are shown in the background. The x-axis is on a log scale.
[00483] [00483] In relation to FIGURE 37G, a pair of acceptor and splice donor motifs, placed 150 nt apart, are traversed along the HMGCR gene. At each position, the K562 nucleosome signal and the pair's probability of forming an exon at that position are shown, as predicted by SpliceNet-10k.
[00484] [00484] In relation to FIGURE 37H, the middle nucleosome K562 and GM12878 signal near private mutations that are predicted by the SpliceNet-10k model to create new exons in the GTEx cohort. The p-value per permutation test is shown.
[00485] [00485] FIGURES 38A, 38B, 38C, 38D, 38E, 38F, and 38G depict an implementation of the validation of rare cryptic splice mutations in the RNA-seq data.
[00486] [00486] In relation to FIGURE 38A, to assess the impact that changes the splice of a mutation, SpliceNet-10k predicts acceptor and donor scores at each position in the pre-mRNA sequence of the gene with and without the mutation, as shown in this document for rs397515893, a pathogenic cryptic splice variant in the intronMYBPC3 associated with cardiomyopathy. The Score A value for the mutation is the biggest change in splice prediction scores within 50 nt of the variant.
[00487] [00487] In relation to FIGURE 38B, we pointed out private genetic variants (observed in only one of the 149 individuals in the GTEx cohort) with the SpliceNet-10k model. The enrichment of private variants planned to change the splicing (Score A> 0.2, blue) or that have no effect on the splicing (Score A <0.01, red) in the vicinity of private exon hop junctions (in above) or private acceptor and donor sites (below). The y-axis shows the number of times that a private splice event and a nearby private genetic variant co-occur in the same individual, compared to the expected numbers obtained through permutations.
[00488] [00488] Regarding FIGURE 38C, example of a heterozygous variant synonym in PYGB that creates a new donor site with incomplete penetrance. The RNA-seq coverage, the junction reading count and the junction positions (blue and gray arrows) are shown for the individual with the variant and a control individual. The effect size is calculated as the difference in the use of the new junction (AC) between individuals with the variant and individuals without the variant. In the stacked bar chart below, we show the number of readings with the reference or alternative allele that used the annotated junction or the new junction (“no splicing” and “new junction” respectively). The total number of reference readings differed significantly from the total number of alternative readings (P = 0.018, binomial test), suggesting that 60% of the splicing transcribed in the new junction is absent in the RNA-sec data, probably due to meaningless mediated decay. (NMD).
[00489] [00489] In relation to FIGURE 38A, the fraction of cryptic splice mutations predicted by the SpliceNet-10k model that were validated against the GTEx RNA-segq data. The validation rate of interruptions of essential acceptor or donor dinucleotides (dashed line) is less than 100% due to coverage and deterioration mediated by meaningless mutation.
[00490] [00490] In relation to -FlG.38E, distribution of effect sizes for validated cryptic splice predictions. The dashed line (50%) corresponds to the expected effect size of fully penetrating heterozygous variants. The size of the measured effect of the essential interruptions of acceptor or donor dinucleotides is less than 50% due to decay mediated by meaningless mutation or unaccounted isoform changes.
[00491] [00491] In relation to FIG.38F, SpliceNet-10k sensitivity in detecting private variants of splice changers in the GTEx cohort at different cutoff points of Score A. The variants are divided into deep intronic variants (> 50 nt of exons) and variants close to exons (overlapping exons or <50 nt exon-intron limits).
[00492] [00492] Regarding the FlIG.38G, validation rate and sensitivity of the SpliceNet-10k and three other methods for forecasting the splice site at different confidence cut-off points. The three points on the SpliceNet-10k curve show the performance of the SpliceNet-10k at cutoff points of Score A of 0.2, 0.5 and 0.8. For the other three algorithms, the three points on the curve indicate their performance at the limits where they predict the same number of cryptic splice variants as the SpliceNet-10k at 0.2, 0.5 and 0 score A cutoff points, 8.
[00493] [00493] FIGURES 39A, 39B and 39C represent an implementation of cryptic splice variants that often create alternative tissue-specific splicing.
[00494] [00494] In relation to FIG. 39A, example of an exonic heterozygous variant on CDC25B that creates a new donor site. The variant is particular for a single individual in the GTEx cohort and exhibits alternative tissue-specific splicing that favors a greater fraction of the new muscle splice isoform compared to fibroblasts (P = 0.006 by Fisher's exact test). The RNA-segq coverage, the junction reading count and the junction positions (blue and gray arrows) are shown for the individual with the variant and a control individual, both in the muscles and in the fibroblasts.
[00495] [00495] Emrelaçãoà-FlG.39B, example of a variant creating heterozygous exonic receptors in FAM229B that exhibits consistent tissue-specific effects in all three individuals in the GTEx cohort that harbor the variant. The RNA-seq for artery and lung are shown for the three individuals with the variant and a control individual.
[00496] [00496] Regarding -FlG.39C, fraction of splice-creating variants in the GTEx cohort that are associated with significantly non-uniform use of the new junction through the expressing tissues, assessed by the chi-square test for homogeneity. Cryptic splice variants validated with low to intermediate A score values were more likely to result in alternative tissue-specific splicing (P = 0.015, Fisher's exact test).
[00497] [00497] FIGURES 40A, 40B, 40C, 40De 40E depict an implementation of predicted cryptic splice variants that are strongly harmful to human populations.
[00498] [00498] In relation to-FlG.40A, synonymous and intronic variants (50 nt of known exon-intron limits and excluding essential GT or AG dinucleotides) with expected splice alteration effects (Score A> 0.8) are strongly depleted in the common allelic frequencies (> 0.1%) in the human population in relation to rare variants observed only once in 60,706 individuals. The odds ratio of 4.58 (P <102 ”By the chi-square test) indicates that 78% of the predicted cryptic splice variants that have recently emerged are sufficiently harmful to be removed by natural selection.
[00499] [00499] Regarding -FlG.40B, fraction of protein truncation variants and synonymous and intronic cryptic splice variants predicted in the ExAC data set that are deleterious, calculated as in (A).
[00500] [00500] In relation to FIG. 40C, fraction of synonymous and intronic cryptic splice gain variants in the ExAC data set that are deleterious (Score A> 0.8), divided based on whether the variant is expected to cause a frame shift or not.
[00501] [00501] Regarding FIG. A40D, fraction of predicted protein truncation variants and deep intronic cryptic splice variants (> 50 nt known exon-intron limits) in the gnomAD dataset which are deleterious.
[00502] [00502] In relation to the FL.40E, average number of rare variants (allele frequency <0.1%) protein truncates and rare functional cryptic splice variants per individual human genome. The number of cryptic splice mutations that are expected to be functional is estimated based on the fraction of predictions that are harmful. The total number of forecasts is higher.
[00503] [00503] FIGURES 41A, 41B, 41C, 41D, 41E and 41F represent an implementation of de novo cryptic splice mutations in patients with rare genetic disease.
[00504] [00504] In relation to FlG.41A, de novo cryptic splice mutations per person for patients in the Deciphering Developmental Disorders (DDD) cohort, individuals with autistic spectrum disorders (DEA) from the Simons Simplex Collection and the Autism Sequencing Consortium, as well as healthy controls. The enrichment in the DDD and ASD cohorts is shown above the healthy controls, adjusting to the verification of variants between the cohorts. The error bars show 95% confidence intervals.
[00505] [00505] In relation to -FG.41B, estimated proportion of new pathogenic mutations by functional category for the DDD and ASD cohorts,
[00506] [00506] In relation to -FG.41C, enrichment and excess of de novo cryptic splice mutations in the DDD and ASD cohorts compared to healthy controls at different A score limits.
[00507] [00507] Regarding FIG. 41D, list of new disease candidate genes enriched for de novo mutations in the DDD and ASD cohorts (FDR <0.01), when predicted cryptic splice mutations were included together with protein coding mutations in the enrichment analysis. The phenotypes present in multiple individuals are shown.
[00508] [00508] In relation to -FG.41E, three examples of de novo cryptic splice mutations predicted in patients with autism that validate in RNA-segq, resulting in intron retention, exon jump and exon extension, respectively. For each example, RNA-seq coverage and junction counts for the affected individual are shown at the top and a control individual without the mutation is shown at the bottom. The sequences are shown in the sense strand in relation to the gene transcription. The blue and gray arrows mark the positions of the junctions in the individual with the variant and the control individual respectively.
[00509] [00509] Regarding the-FlG.41F, validation status for 36 predicted cryptic splice sites selected for experimental validation by RNA-segq. EXPERIMENTAL MODEL AND DETAILS OF THE SUBJECTS
[00510] [00510] The details of the subjects for the 36 patients with autism were previously disclosed by lossifov et al, Nature 2014 (Table S1) and can be crossed using the anonymized identifiers in Column 1 of Table S4 in our article. METHOD DETAILS
[00511] [00511] We have trained several models based on ultra deep convolutional neural networks to predict computationally the splicing of the pre-mRNA nucleotide sequence. We designed four architectures, namely, SpliceNet-80nt, SpliceNet-400nt, SpliceNet-2k and SpliceNet-10k, which use 40, 200, 1,000 and 5,000 nucleotides on each side of a position of interest as input, respectively, and generate the probability the position is an acceptor and splice donor. More precisely, the entry for the models is a sequence of one-hot encoded nucleotides, where A, C, G and T (or equivalently U) are encoded as [1, O, O, 0], [0, 1, O , O], [0, 0, 1, 0] and [0, 0, 0, 1] respectively and the output of the models consists of three scores that add up to one, corresponding to the probability that the position of interest is a splice acceptor , splice donor and neither.
[00512] [00512] The basic unit of the SpliceNet architectures is a residual block (He et al., 2016b), which consists of batch normalization layers (loffe and Szegedy, 2015), rectified linear units (ReLU) and organized convolutional units of specific manner (FIGS. 21, 22, 23 and 24). Residual blocks are commonly used when designing deep neural networks. Before the development of residual blocks, the deep neural networks, consisting of many convolutional units stacked one after the other, were very difficult to train due to the problem of explosion / disappearance of gradients (Glorot and Bengio, 2010) and the increase in the depth of these neural networks often resulted in a major training error (He et al., 2016a). Through a comprehensive set of computational experiments, architectures that consist of many residual blocks stacked one after the other have been shown to be able to overcome these problems (He et al., 2016a).
[00513] [00513] The complete SpliceNet architectures are provided in FIGs. 21, 22, 23 and 24. The architectures consist of K stacked residual blocks connecting the input layer to the penultimate layer and a convolutional unit with softmax activation connecting the penultimate layer to the output layer. The residual blocks are stacked so that the output of the i * residual block is connected to the input of the í + 1st residual block. In addition, the output of each fourth residual block is added to the input of the penultimate layer. These "skip connections" are commonly used in deep neural networks to increase the speed of convergence during training (Oord et al., 2016).
[00514] [00514] Each residual block has three hyperparameters N, W and D, where N indicates the number of convolutional nuclei, W indicates the window size and D indicates the expansion rate (Yu and Koltun, 2016) of each convolutional nucleus. As a convolutional core of window size W and rate of expansion D extracts resources that span (W - 1) D neighboring positions, a residual block with hyperparameters N W and D extracts resources that span 2 (W - 1) D neighboring positions. Therefore, the total range of neighbors for SpliceNet architectures is given by S = YK, 2 (W; - 1) D ;, where N; W; and D; are the hyperparameters of the i * "residual block. For the SpliceNet-80nt, SpliceNet-400nt, SpliceNet-2k and SpliceNet-10k architectures, the number of residual blocks and the hyperparameters for each residual block were chosen so that S is equal to 80, 400, 2,000 and
[00515] [00515] SpliceNet architectures have only standardization and non-linear activation units, in addition to convolutional units. Consequently, the models can be used in a sequence-by-sequence mode with variable sequence length (Oord et al., 2016). For example, the entry for the SpliceNet-10k model (S =
[00516] [00516] Our models adopted the residual block architecture, which became widely adopted due to its success in image classification. Residual blocks comprise repeated convolution units, interspersed with skip connections that allow information from previous layers to skip residual blocks. In each residual block, the input layer is first normalized in batch, followed by an activation layer using rectified linear units (ReLU). The activation is then passed through a one-dimensional convolution layer. This intermediate output of the one-dimensional convolution layer is again normalized in batch and activated by ReLU, followed by another one-dimensional convolution layer. At the end of the second one-dimensional convolution, we add its output to the original entry in the residual block, which acts as a skip connection, allowing the information from the original entry to ignore the residual block. In such an architecture, called the deep residual learning network by its authors, the input is preserved in its original state and the residual connections are kept free of non-linear activations of the model, allowing the effective training of deeper networks.
[00517] [00517] After the residual blocks, the softmax layer calculates probabilities of the three states for each amino acid, among which the highest probability softmax determines the state of the amino acid. The model is trained with the accumulated categorical cross-entropy loss function for the entire protein sequence using the ADAM optimizer.
[00518] [00518] Atrous / dilated convolutions allow for large receptive fields with few trainable parameters. An atrous / dilated convolution is a convolution in which the nucleus is applied over an area greater than its length, skipping the input values with a certain step, also called atrous convolution rate or dilation factor. Atrous / dilated convolutions add spacing between the elements of a convolution filter / nucleus, so that neighboring inputs (eg nucleotides, amino acids) at longer intervals are considered when a convolution operation is performed. This allows the incorporation of long-range contextual dependencies into the entry. Atrous convolutions retain partial convolution calculations for reuse as the adjacent nucleotides are processed.
[00519] [00519] The illustrated example uses one-dimensional convolutions. In other implementations, the model can use different types of convolutions, such as two-dimensional convolutions, three-dimensional convolutions, dilated or atrous convolutions, transposed convolutions, separable convolutions and deep separable convolutions. Some layers also use the ReLU activation function, which greatly accelerates the convergence of the descent of the stochastic gradient compared to saturating nonlinearities, such as the sigmoid or hyperbolic tangent. Other examples of activation functions that can be used by the disclosed technology include parametric ReLU, leaking ReLU and exponential linear unit (ELU).
[00520] [00520] Some layers also use batch normalization (loffe and Szegedy 2015). With respect to batch normalization, the distribution of each layer in a convolution neural network (CNN) changes during training and varies from one layer to another. This reduces the convergence speed of the optimization algorithm. Batch normalization is a technique to overcome this problem. Denoting the entry of a batch normalization layer with x and its output using z, the batch normalization applies the following transformation to x:
[00521] [00521] Batch normalization applies average variation normalization at input x using yu and and scales linearly and shifts it using y and B. Normalization parameters yu and o are calculated for the current layer in the training set using a method called average exponential mobile. In other words, they are not trainable parameters. On the other hand, y and B are trainable parameters. The values of yu and o calculated during training are used in the forward pass during inference.
[00522] [00522] —We downloaded the GENCODE gene annotation table (Harrow et al., 2012) V24 / ift37 in the UCSC table browser and extracted
[00523] [00523] We use the following procedure to train and test the models in a sequence-by-sequence mode with pieces of size 1 = 5,000. For each gene, the MRNA transcription sequence between the initial and final canonical transcription sites was extracted from the hg19 / GRCh37 set. The transcription sequence of the input MRNA was encoded by one-hot as follows: A, C, G, T / U mapped to [1, O, O, O], [0, 1, O, O], [ 0, O, 1, O], [0, O, O, 1], respectively. The nucleotide sequence encoded by one-hot was filled with zero until the length became a multiple of 5,000 and then filled with zero at the beginning and at the end with a sequence of flanking length S / 2, where S is equal to 80, 400 , 2,000 and 10,000 for the models SpliceNet-80nt, SpliceNet-400nt, SpliceNet-2k and SpliceNet-10k, respectively. The filled nucleotide sequence was then divided into blocks of length S / 2 + 5,000 + S / 2 in such a way that block i * consisted of the nucleotide positions of 5,000 (i - 1) - S / 2 + 1 to 5,000i + S /2. Likewise, the splice exit label sequence was encoded by one-hot as follows: not a splice site, a splice acceptor (first corresponding exon nucleotide) and a splice donor (last corresponding exon nucleotide) ) were mapped to [1, O, 0], [0, 1, O] and [0, O, 1], respectively. The one-hot coded splice exit label string was filled in with zero until the length became a multiple of 5,000 and then divided into 5,000 length blocks, so that the i * block consisted of 5,000 (i - 1 positions) ) + 1 to 5,000i. The one-hot coded nucleotide sequence and the corresponding one-hot coded label sequence were used as inputs for the model and model target outputs, respectively.
[00524] [00524] The models have been trained for 10 seasons with a batch size of 12 on two NVIDIA GeForce GTX 1080 Ti GPUs. The loss of categorical cross entropy between target outputs and predicted outputs has been minimized using the Adam optimizer (Kingma and Ba , 2015) during training. The learning rate of the optimizer was set to 0.001 for the first 6 seasons and then reduced by a factor of 2 in all subsequent seasons. For each architecture, we repeated the training procedure 5 times and obtained 5 trained models (FIGS. 53A and 53B). During the test, each input was evaluated using all 5 trained models and the average of its outputs was used as the predicted output. We use these models for the analyzes in FIGs. 37A and other related figures.
[00525] [00525] For the analyzes in FIGs. 38A-G, 39A-C, 40A-E and 41A- F involving the identification of splice altering variants, we improved the GENCODE annotation training set to also include new splice joints commonly seen in the GTEx cohort on chromosomes 2, 4, 6, 8, 10-22, X, Y (67,012 splice donors and 62,911 splice acceptors). This increased the number of splice join annotations in the training set by -50%. Training of the network in the combined data set improved the sensitivity of detecting splice changer variants in the RNA-seq data compared to the network trained only in the GENCODE annotations (FIGS. 52A and 52B), particularly to predict variant variants of deep intronic splice and we use this network for analyzes involving evaluation of variants (FIGs. 38A-G, 39A-C, 40A-E and 41A-F and related figures). To ensure that the GTEx RNA-seq data set does not contain overlap between training and evaluation, we only include joins present in 5 or more individuals in the training data set and only evaluated the performance of the network in the variants present in 4 or less. Details of the new splice joint identification are described in “Detection of splice joints" in the GTEx analysis section of the methods.
[00526] [00526] A precision metric, such as the percentage of positions classified correctly, is largely ineffective due to the fact that most positions are not splice sites. Instead, we evaluate the models using two metrics that are effective in such configurations, namely, top-k accuracy and area under the precision-recall curve. The top-k precision of a specific class is defined as follows: Suppose the test set has k positions that belong to the class. We choose the limit so that exactly k positions of the test set are predicted to belong to the class. The fraction of these k predicted positions that actually belong to the class is reported as the top-k precision. In fact, this is equal to precision when the limit is chosen so that precision and recall have the same value.
[00527] [00527] —We obtained a list of all transcripts from lincºRNA based on the notes GENCODE V24lift37. Unlike protein-coding genes, lincRNAs do not receive a major transcript in the GENCODE annotations. To minimize redundancy in the validation set, we identified the transcription with the highest total exonic sequence per lincRNA gene and called it canonical transcription for the gene. Since lincºRNA annotations are expected to be less reliable than annotations for protein-encoding genes, and such annotations would affect our top-k accuracy estimates, we used GTEx data to eliminate lihncRNAs with possible annotation problems (see section “ Analyzes in the GTEx dataset "below for details on that data). For each lincRNA, we count all the split readings mapped across the length of the lincRNA across all GTEx samples (see" Junction detection "below for details) This was an estimate of the total lincRNA junction extension readings that used annotated or new junctions We also count the number of readings that span canonical transcription junctions. We only consider lincRNAs for which at least 95% of the junction extension readings in all GTEx samples they corresponded to canonical transcription. We also demand that all junctions of canonical transcription be observed by at least once in the GTEx cohort (excluding junctions that spanned introns of length <10 nt). To calculate the top-k precision, we consider only the junctions of the canonical transcripts of the lincRNAs that passed the filters above (781 transcripts, 1047 junctions).
[00528] [00528] NaFliG. 37B, we compared the performance of MaxEntScan and SpliceNet-10k with respect to the identification of the limits of the canonical exon of a gene from its sequence. We used the CFTR gene, which is in our set of tests and has 26 canonical acceptors and splice donors, as a case study and we obtained an acceptor and donor score for each of the 188,703 positions of the canonical transcription initial site (chr7: 117,120,017) for the final canonical transcription site (chr7: 117,308,719) using MaxEntScan and SpliceNet-10k. A position was classified as an acceptor or splice donor if its corresponding score was greater than the limit chosen when assessing the accuracy of the top-k. MaxEntScan predicted 49 splice acceptors and 22 splice donors, of which 9 and 5 are true donors and donors, respectively. For better visualization, we show MaxEntScan's pre-log scores (cut to a maximum of 2,500). The SpliceNet-10k predicted 26 splice acceptors and 26 splice donors, all correct. For FIG. 42B, we repeated the analysis using the LINCO00467 gene.
[00529] [00529] We calculated the inclusion rate of all exons annotated with GENCODE from the RTE-seq data from GTEx (FIG. 37C). For each exon, excluding the first and last exons of each gene, we calculated the inclusion rate as: (L + R) / 2 S + (L + R) / 2
[00530] [00530] whereL is the total reading count from the junction of the previous canonical exon to the exon under consideration in all samples of
[00531] [00531] In FiFi.37D, we identified the nucleotides that are considered important by SpliceNet-10k for the classification of a position as a splice acceptor. For this purpose, we consider the splice acceptor in chr3: 142,740,192 in the U2SURP gene, which is in our test set. The "importance score" of a nucleotide in relation to a splice acceptor is defined as follows: Let ses identify the score of the splice acceptor under consideration. The acceptor score is recalculated by replacing the nucleotide under consideration with A, C, G and T. Let these scores be identified by s ,, Sc Sg AND ST respectively. The nucleotide importance score is estimated as: Sa + FSc + Se + S7T Srefr = A
[00532] [00532] This procedure is generally called in-silico mutagenesis (Zhou and Troyanskaya, 2015). We plotted 127 nucleotides from chr3: 142,740,137 to chr3: 142,740,263 in such a way that the height of each nucleotide is its importance score in relation to the splice acceptor in chr3: 142,740,192. The plotting function was adapted from the DeepLIFT software (Shrikumar et al., 2017).
[00533] [00533] To study the impact of the branch point sequence position on the acceptor strength, we first obtained the acceptor scores from the 14,289 splice acceptors in the test suite using the SpliceNet-10k. Leave y ,,.; identify the vector that contains these scores. For each value from 0 to 100, we did the following: For each set of acceptor splice tests, we replaced the nucleotides from positions i to i - 6 before the splice acceptor with TACTAAC and recalculated the acceptor score using SpliceNet-10k. The vector containing these scores is indicated by yar.i- We plot the following quantity as a function of ii in FIG. 43A: average (yat ;: = Yrer)
[00534] [00534] For FIG. 43B, we repeat the same procedure using the SR GAAGAA protein motif. In this case, we also studied the impact of the motif when present after the splice acceptor, as well as the impact on the donor's strength. GAAGAA and TACTAAC were the reasons for the greatest impact on the strength of acceptors and donors, based on comprehensive research in the k-mer space.
[00535] [00535] To study the effect of exon length on splicing, we filter the exons from the test set that were the first or the last exon. This filtering step removed 1,652 of the 14,289 exons. We classify the remaining 12,637 exons in the order in which we increase the length. For each of them, we calculate a splice score by averaging the acceptor score on the splice acceptor site and the donor score on the splice donor site using SpliceNet-80nt. We plot the splicing scores as a function of the exon length in FIG. 37F. Before plotting, we apply the following smoothing procedure: Let x denote the vector that contains the lengths of exons and we denote y the vector that contains its corresponding splicing scores. We smooth the two x's and use y a 2,500 average window.
[00536] [00536] We repeated this analysis by calculating the splicing scores using the SpliceNet-10k. In the fundamentals, we show the histogram of the lengths of the 12,637 exons considered for this analysis. We applied a similar analysis to study the effect of intron length on splicing, with the main difference being that it was not necessary to exclude the first and the last exons.
[00537] [00537] We downloaded the nucleosome data for the K562 cell line from the UCSC genome browser. We used the HMGR gene, which is in our test suite, as an anecdotal example to demonstrate the impact of nucleosome placement on the SpliceNet-10k score. For each p position in the gene, we calculate its "planted splicing score" as follows: + The 8 nucleotides from positions p + 74 to p + 81 have been replaced by an AGGTAAGG donor motif.
[00538] [00538] The nucleosome signal K562, as well as the splicing score planted for the 5,000 positions from chr5: 74,652,154 to chr5: 74,657,153 is shown in FIG. 37G.
[00539] [00539] To calculate the Spearman correlation across the entire genome between these two bands, we randomly chose one million intergenic positions that were at least 100,000 nt from all canonical genes. For each of these positions, we calculate your planted splicing score, as well as your average K562 nucleosome signal (window size of 50 was used for the mean). The correlation between these two values in the 1 million positions is shown in FIG. 37G.
[00540] [00540] —Paracada one of the 14,289 splice receptors in the test set, we extract nucleosome data in 50 nucleotides on each side and calculate its nucleosome enrichment as the average signal on the exon side divided by the average signal on the intron side. We rank splice acceptors in increasing order of nucleosome enrichment and calculate their scores using the SpliceNet-80nt. The acceptor scores are plotted against the enrichment of nucleosomes in FIG. 44B. Before plotting, the smoothing procedure used in FIG. 37F was applied. We repeated this analysis using the SpliceNet-10k and also for the 14,289 splice donors in the test suite.
[00541] [00541] For FIG. 37H, we wanted to observe the nucleosome signal around new predicted exons. To ensure that we were analyzing new highly reliable exons, we selected only singleton variants (variants present in a single GTEx individual) where the predicted winning junction was completely private for the individual with the variant. In addition, to remove confounding effects from nearby exons, we analyzed only intronic variants at least 750 nt of the noted exons. We download nucleosome signals for cell lines GM12878 and K562 from the UCSC browser and extract the nucleosome signal at 750 nt from each of the predicted new acceptor or donor sites. We averaged the nucleosome signal between the two cell lines and inverted the signal vectors for variants that overlapped genes on the negative strand. We shifted the signal from the 70 nt acceptor sites to the right and the signal from the 70 nt donor sites to the left. After the change, the signal from the nucleosome to the acceptor and donor sites was centered in the middle of an idealized exon of length 140 nt, which is the average length of exons in the GENCODE v19 annotations. Finally, we average all the displaced signals and smooth the resulting signal by averaging in an 11 nt window centered at each position.
[00542] [00542] To test an association, we selected random random SNVs, which were at least 750 nt of exons noted and were predicted by the model as having no effect on splicing (Score A <0.01). We created 1000 random samples of these SNVs, each sample having as many SNVs as the set of splice sites gain sites that were used for FIG. 37H (128 sites). For each random sample, we calculate an average smoothed signal as described above. As it was not anticipated that random SNVs would create new exons, we centralized the signal of the nucleosome of each SNV in the SNV itself and randomly shifted 70 nt to the left or 70 nt to the right. Next, we compared the signal from the nucleosome to the base of the middle of FIG. 37H for the signals obtained from the 1000 simulations on that basis. An empirical p-value was calculated as the fraction of simulated sets that had an average value greater than or equal to that observed for the splice site gain variants.
[00543] [00543] To investigate the generalization of network forecasts, we evaluated SpliceNet-10k in regions with variable exon density. First, we separate the test set positions into 5 categories, depending on the number of canonical exons present in a
[00544] [00544] Training various models and using the average of their predictions as a result is a common strategy in machine learning to obtain better predictive performance, known as joint learning. In FIG. 53A, we show the top-k accuracy and the area under the precision-recall curves of the 5 SpliceNet-10k models we trained to build the set. The results clearly demonstrate the stability of the training process.
[00545] [00545] We also calculate Pearson's correlation between his predictions. Since most positions in the genome are not Splice sites, the correlation between the predictions of most models would be close to 1, making the analysis useless. To overcome this problem, we considered only the positions in the test set that were assigned an acceptor or donor score greater than or equal to 0.01 by at least one model. This criterion was met by 53,272 positions (approximately equal number of splice and non-splice sites). The results are summarized in FIG. 53B. Pearson's very high correlation between model predictions further illustrates its robustness.
[00546] [00546] We show the effect of the number of models used to build the set on performance in FIG. 53C. The results show that performance improves as the number of models increases, with decreasing returns. Il. Analyzes in the GTEx RNA-seq dataset The score of a single nucleotide variant
[00547] [00547] —We quantify the splicing change due to a single nucleotide variant as follows: First we use the reference nucleotide and calculate the acceptor and donor scores for 101 positions around the variant (50 positions on each side). Suppose these scores are indicated by the ares € dref vectors respectively. Then, we use the alternative nucleotide and recalculate the acceptor and donor scores. Let these scores be indicated by the vectors a, and give respectively. We evaluated the following four quantities: Score A (acceptor gain) = max (ag - ref) Score A (acceptor loss) = max (ares - dat) Score A (donor gain) = max (da - drer) Score A (loss of donor) = max (drer - dat)
[00548] [00548] The maximum of these four scores is called A Variant Score. Criteria for quality control and filtering of variants We downloaded the GTEx VCF and RNA-sec data from dbGaP (access to the study —phs000424.v6.p1; https://www.ncbi.nlm.nih.gov/projects/gap/ cgi- bin / study.cgi study id = phs000424.v6.p1).
[00549] [00549] - We evaluated the performance of SpliceNet in autosomal SNVs that appeared in a maximum of 4 individuals from the GTEx cohort.
[00550] [00550] For variants that meet these criteria in at least one individual, we consider all individuals in which the variant appeared (even if it did not meet the criteria above) as having the variant. We refer to variants that appear in a single individual as singleton and variants that appear in 2-4 individuals as common. We did not evaluate variants that appear in 5 or more individuals, in order to avoid overlap with the training data set.
[00551] [00551] We used OLego (Wu et al., 2013) to map the readings of the GTEx samples in relation to the hg 19 reference, allowing a maximum editing distance of 4 between the query reading and the reference (parameter -M 4 ). Note that OLego can operate completely again and does not require any gene annotation. As OLego looks for the presence of splicing motifs at the ends of divided readings, their alignments can be biased towards or against the reference around SNVs that interrupt or create splice sites, respectively.
[00552] [00552] We use the leafcutter cluster, a utility in the leafcutter package (Li et al., 2018), to detect and count splice joints in each sample. We require a single split reading to support a join and assume a maximum intron length of 500 KB (parameters -m 1 - | 500000). To obtain a set of highly reliable joints for training the deep learning model, we compiled the joint of all leafcutter joints in all samples and then
[00553] [00553] —Junctions present in 5 or more individuals were used to increase GENCODE's annotated splice junction list for analysis in variant prediction (FIGs. 38A-G, 39A-C, 40A-E and 41A-F) . Links to files containing the list of splice joints used to train the model are provided in the Main Resources table.
[00554] [00554] Although we used junctions detected by the leafcutter to increase the training data set, we noticed that, despite the use of relaxed parameters, the leafcutter was filtering many junctions with good support in the RNA-seq data. This artificially reduced our validation fees. Thus, for the GTEx RNA-seq validation analyzes (FIGs. 38A-G and 39A-C), we recalculate the set of junctions and junction counts directly from the RNA-seq reading data. We count all split mapping readings not duplicated with MAPQ at least 10 and at least 5 nt aligned on each side of the junction. One reading was allowed to cover more than two EXONS; In this case, the reading was counted at each junction with at least 5 nt of sequence mapped on both sides.
[00555] [00555] A joint was considered private in individual A if it met at least one of the following criteria:
[00556] [00556] Tissues with less than 5 samples from other individuals (not A) were ignored for this test.
[00557] [00557] If a private junction has exactly one end annotated, based on the GENCODE annotations, we consider candidate for acceptor or donor gain and search for singleton SNVs (SNVs that appear on a single GTEx individual) that were private on the same individual at 150 nt from end not noted. If a private junction has both extremes noted, we consider it a candidate for a private exon jump event if it jumped at least one, but no more than 3 exons of the same gene, based on the GENCODE notes. Next, we look for singleton SNVs within 150 nt from the ends of each of the bounced exons. Private joins with both ends missing from the GENCODE exon annotations were ignored, as a substantial fraction of them were alignment errors.
[00558] [00558] To calculate the enrichment of singleton SNVs around new acceptors or private donors (FIG. 38B, lower), we add the counts of singleton SNVs in each position in relation to the private junction. If the overlapping gene was on the negative strand, the relative positions were reversed. We divided SNVs into two groups: SNVs that were private in the individual with the private join and SNVs that were private in a different individual. To smooth the resulting signals, we average the counts in a 7 nt window, centered at each position. Next, we calculated the ratio of smoothed counts in the first group (private in the same individual) to smoothed counts in the second group (private in a different individual). For new private exon hops (FIG. 38B, top), we followed a similar procedure, aggregating the singleton SNV counts around the ends of the bounced exons.
[00559] [00559] For private variants (appearing in an individual in the GTEx cohort) or common variants (appearing in two to four individuals in the GTEx cohort), we obtained the predictions of the deep learning model for the reference and alternative alleles and calculated the Score . We also obtained the location where the model predicted the aberrant junction (new or broken). Then, we tried to determine if there was evidence in the RNA-seq data supporting a splicing aberration in individuals with the variant at the predicted location. In many cases, the model can predict several effects for the same variant, for example, a variant that interrupts an annotated splice donor can also increase the use of a suboptimal donor, as in FIG. 45, in which case, the model can predict a loss of donor at the annotated splice site and a gain of donor at the suboptimal site. However, for validation purposes, we only consider the effect with the highest predicted score A for each variant. Therefore, for each variant, we consider the expected effects of creating splice sites and interrupting splice sites separately. Note that joints that appear in fewer than five individuals were excluded during model training, to avoid evaluating the model in new joints in which it was trained.
[00560] [00560] For each private variant that is expected to cause new junction formation, we use the network to predict the position of a newly created aberrant splice junction and examine the RNA-seq data to validate whether that new junction appeared only at individual with SNV and no other GTEx individual. Likewise, for a variant that is expected to cause a loss at the splice site affecting an exon X splice site, we look for new exon jump events, from the previous canon exon (the one upstream of the X based on the GENCODE annotations) until the next canonical exon (the one downstream of X) that appeared only in individuals with the variant and in no other individual in GTEx. We exclude predicted losses if the splice site that is expected to be lost by the model was not noted in GENCODE or was never observed in GTEx individuals without the variant. We also exclude expected earnings if the splice site that is expected to be won has already been noted in GENCODE. To extend this analysis to common variants (present in two to four individuals), we also validated new joins that were present in at least half of the individuals with the variant and absent in all of the individuals without the variant.
[00561] [00561] Using the requirement that the predicted aberrant splice event be private for individuals with the variant, we could validate 40% of the expected high score acceptor and donor gains (A Score> 0.5), but only 3.4 % of predicted high score losses and
[00562] [00562] For a junction j of the sample s, we obtained a normalized junction count cs: Cj; = asinh =) (1)
[00563] [00563] —Aquir, is the raw junction count for junction j in sample if the sum in the denominator is taken over all other junctions between annotated acceptors and donors of the same gene as j (using annotations from GENCODE v19). The asinh transformation is defined as asinh (x) = In (x + Vx + D). It is similar to the logarithmic transformation often used to transform RNA-seq data (Lonsdale et al., 2013), however, it is set to 0, thus eliminating the need for pseudo-counts, which would have substantially skewed values, since many junctions, especially the new ones,
[00564] [00564] For each acquired or lost junction already predicted to be caused by a SNV appearing in a set of individuals /, we calculate the following z-score in each tissue t separately: 7: = averages, (Cjs) - averages', (Cjs !) - stdsev, (behold!)
[00565] [00565] where A, is the sample set of individuals in! in t and U tissue, is the set of samples from other individuals in t tissue. Note that there may be multiple samples in the GTEx data set for the same individual and tissue. As before c ;, is the junction count j in sample s. For the predicted losses, we also calculated a similar z score for the k junction that bounces the supposedly affected exon: averages'eu, (cxs) - mediumçea, (Cks) Feu Stds'ev, (Cxso) S
[00566] [00566] Note that a loss that resulted in hops would lead to a relative decrease in the lost joint and a relative increase in hops. This justifies the reversal of the difference in the 2z numerators; € Zx +, therefore, both scores would tend to be negative for a real loss of the splice site.
[00567] [00567] Finally, we calculated the median z score in all tissues considered. For losses, we calculate the median for each of the z scores in equations (2) and (3) separately. An acceptor or donor loss prediction was considered valid if any of the following situations were true:
[00568] [00568] A description of the permutations used to obtain the cutoff points above is provided in the section “Estimating false validation rates”.
[00569] [00569] Empirically, we observed that we needed to apply stricter validation criteria for losses compared to gains, since, as explained in the section “Validation of predicted cryptic splice mutations based on private splice junctions”, losses tend to result in more mixed effects than gains. Observing a new junction near a private SNV is very unlikely to occur by chance, so even small evidence of the junction must be sufficient for validation. On the other hand, most of the predicted losses resulted in the weakening of an existing junction, and that weakening is more difficult to detect than the on-off change caused by gains and more likely to be attributed to noise in the RNA-segq data.
[00570] [00570] To avoid calculating z scores in the presence of low counts or low coverage, we use the following criteria to filter variants for the validation analysis:
[00571] [00571] Variants for which there were no tissues that met the criteria for consideration above were considered non-determinable and were excluded in the calculation of the validation rate. For splice gain variants, we filter those that occur on existing Splice sites with annotations in GENCODE. Likewise, for splice loss variants, we consider only those that decrease the score of existing splice sites with annotations in GENCODE. In general, 55% and 44% of the expected high score gains and losses (Score> 0.5), respectively, were considered determinable and used for the validation analysis.
[00572] [00572] To ensure that the above procedure has reasonable rates of true validation, we first examine SNVs that appear in 1-4 GTEx individuals and disrupt the essential GT / AG dinucleotides. We argue that these mutations almost certainly affect splicing, so their validation rate should be close to 100%. Among these breaches, 39% were verifiable based on the criteria described above and, among those verifiable, the validation rate was 81%. To estimate the false validation rate, we exchange the individual labels of the SNV data. For each SNV that appeared in k GTEx individuals, we chose a random subset of k GTEx individuals and assigned the SNV to them. We created 10 of these random data sets and repeated the validation process on them. The validation rate in the exchanged data sets was 1.7-2.1% for gains and 4.3-6.9% for losses, with a median of 1.8% and 5.7%, respectively. The higher rate of false validation for losses and the relatively low rate of validation of essential interruptions are due to the difficulty in validating splice site losses, as highlighted in the section “Validation of predicted cryptic splice mutations based on private splice junctions ”.
[00573] [00573] We define the "effect size" of a variant as the fraction of transcriptions of the affected gene that changed the splicing patterns due to the variant (for example, the fraction that changed to a new acceptor or donor). As a reference example for a predicted splice gain variant, consider the variant in FIG. 38C. For a predicted À donor, we first identify the junction (AC) for the nearest noted C acceptor. We identified a "reference" junction (BC), where BA is the annotated donor closest to A. In each sample s, we calculated the relative use of the new junction (AC) compared to the reference junction (BC): r U (aB) s = Penny: (4)
[00574] [00574] - Here, ro, is the gross reading count of the junction (CA) in the sample s. For each tissue, we calculate the change in the use of the junction (CA) between individuals with the variant and all other individuals: medium-sized) s - medium-gray, Uçao) s: (5)
[00575] [00575] where A, is the set of samples from individuals with the variant in tissue t and U, is the set of samples from other individuals in tissue t. The size of the final effect was calculated as the median of the above difference in all tissues considered. The calculation was similar in the case of an acceptor obtained or in the case where the variant of creating the splice site was intronic. A simplified version of the effect size calculation (assuming a single sample of individuals with and without the variant) is shown in FIG. 38C.
[00576] [00576] For a predicted loss, we first calculate the fraction of transcripts that skipped the affected exon. The calculation is shown in FIG.
[00577] [00577] As for the gains, we calculate the change in the skipped fraction between samples of individuals with the variant and samples of individuals without the variant: mean; es, k (ars - mean; ey, k (ap) s' (7)
[00578] [00578] The skipped fraction of transcripts, as calculated above, does not fully capture the effects of a loss of acceptor or donor, as this interruption can also lead to increased levels of intron retention or use of sub-optimal splice sites. To explain some of these effects, we also calculated the use of the missing junction (CE) in relation to the use of other junctions with the same acceptor E: em: = - O)
[00579] [00579] Here, Xre, is the sum of all junctions of any acceptor (annotated or new) from donor E. This includes the affected junction (CE), the hop junction (AE), as well as potential junctions from others suboptimal donors who compensated for the loss of C, as illustrated in the example in FIG. 45. Next, we calculate the change in the relative use of the affected junction: average, cy, ler) ss - averagesea, lecr) s (9)
[00580] [00580] Note that, unlike (5) and (7), which measure the increase in the use of the obtained junction or jump in individuals with the variant, in (9) we want to measure the decrease in the use of the lost junction, hence the reversal of the two parts of the difference. For each fabric, the effect size was calculated as the maximum of (7) and (9). As for the gains, the final effect size for the variant was the average effect size on the fabrics.
[00581] [00581] A variant was considered for calculating the effect size only if it was considered validated based on the criteria described in the previous section. To avoid calculating the fraction of aberrant transcriptions in very small numbers, we considered only samples where the counts of the aberrant and reference junctions were at least 10. Since most of the cryptic splice variants were in the intron, the size of the effect could not be calculated directly by counting the reference number and alternative readings superimposed on the variant. Therefore, the size of the loss effect is calculated indirectly from the decrease in the relative use of the normal splice joint. For the size of the effect of new junction gains, aberrant transcripts can be impacted by meaningless mediated decay, attenuating the observed effect sizes. Despite the limitations of these measures, we observed a consistent trend towards smaller effect sizes for cryptic splice variants with a lower score in the gain and loss events.
[00582] [00582] For a variant of creating a fully penetrating splice site that causes all transcriptions of the variant haplotype of the individuals with the variant to change to the new junction, and assuming that the new junction does not occur in the control subjects, expected effect size would be 0.5 by equation (5).
[00583] [00583] Likewise, if a heterozygous SNV causes a new exon jump event and all transcripts of the affected haplotype change to the jump junction, the expected effect size in equation (7) is 0.5. If all transcripts from individuals with the variant were switched to a different junction (the jump junction or the offset junction), the ratio in equation (8) would be 0.5 in samples from individuals with the variant and 1 in samples from other individuals , then the difference in equation (9) would be 0.5. This assumes that there were no hops or other junctions in the E acceptor in individuals without the variant. It also assumes that interrupting the splice site does not trigger intron retention. In practice, at least low levels of intron retention are often associated with interruptions at the splice site. In addition, the exon jump is generalized, even in the absence of variants that alter the splice. This explains why the measured effect sizes are below 0.5, even for variants that disrupt essential GT / AG dinucleotides.
[00584] [00584] The expectation of 0.5 effect sizes for fully penetrating heterozygous variants also assumes that the variant has not triggered senseless mediated (NMD) decay. In the presence of NMD, the numerator and denominator of equations (4), (6) and (8) would fall, thus decreasing the size of the observed effect.
[00585] [00585] For FlG.38C once the variant was exonic, we could count the number of readings that covered the variant and had the alternative reference or allele (“Ref (without splicing)” and “Alt (without splicing), respectively ). We also count the number of readings that splice at the new splice site and that presumably carried the alternative allele ("Alt (new junction)"). In the example of FIG. 38C and in many other cases that we observed, we observed that the total number of readings from the haplotype with alternative allele (the sum of “Alt (without splicing) 'and“ Alt (new junction) ”) was less than the number of readings with the reference allele (“Ref (no splicing)”). As we believe that we eliminate reference biases during reading mapping, mapping to reference and alternative haplotypes and assuming that the number of readings is proportional to the number of transcripts with each allele, we expected that the reference allele would take into account half of the readings at the variant location. We assume that the “allele” alternative readings correspond to transcripts of the alternative allele haplotype that amended at the new junction and were degraded by nonsensical mediated decay (NMD). We call this group "Alt (NMD)".
[00586] [00586] To determine whether the difference between the observed reference number and alternative readings was significant, we calculated the probability of observing Alt (without splicing) + Alt (new junction) readings (or less) under a binomial distribution with a probability of success 0 , 5 and a total number of attempts from Alt (without splicing) + Alt (new join) + Ref (without splicing). This is a conservative p-value, as we are underestimating the total number of "attempts" for not counting potentially degraded transcripts. The fraction of NMD transcripts in FIG. 38C was calculated as the number of "Alt (NMD)" readings over the total number of splicing readings in the new junction (Alt (NMD) + Alt (new junction)).
[00587] [00587] To assess the sensitivity of the SpliceNet model (FIG. 38F), we used SNVs that were up to 20 nt from the affected splice site (that is, the new or interrupted acceptor or donor) and not overlapping the essential GT / AG dinucleotide of an annotated exon, and had an estimated effect size of at least 0.3 (see section "Calculating the effect size"). In all sensitivity graphs, SNVs were defined as being "close to exons" if they overlapped with an annotated exon or were within 50 nt of the limits of an annotated exon. All other SNVs were considered "deep intronic". Using this strongly supported cryptic splice site data set, we evaluated our model at varying A score limits and reported the fraction of the cryptic splice sites in the truth data set predicted by the model at that cutoff point.
[00588] [00588] We performed a face-to-face comparison of SpliceNet-10k, MaxEntScan (Yeo and Burge, 2004), GeneSplicer (Pertea et al., 2001) and NNSplice (Reese et al., 1997) in relation to various metrics. We downloaded the MaxEntScan and GeneSplicer software from http://genes.mit.edu/burgelab/maxent/download/ and http://www.ces.jhu.edu/-=genomics/GeneSplicer/, respectively. NNSplice is not available as downloadable software; therefore, we downloaded the training and test sets at http: //www.fruitfly .org / data / seq tools / datasets / Human / GENIE 96 / splicesets / and models trained with the best performing architectures described in (Reese et al. , 1997). As an integrity check, we reproduce the metrics of the set of tests reported in (Reese et al., 1997). To assess the top-k accuracy and the area under the recovery precision curves of these algorithms, we score all positions in the test set genes and in the lincRNAs with each algorithm (FIG. 37D).
[00589] [00589] The MaxEntScan and GeneSplicer outputs correspond to log odds ratios, while the NNSplice and SpliceNet-10k outputs correspond to probabilities. To ensure that we give MaxEntScan and GeneSplicer the best chance of success, we calculate A the scores using them with the standard output and a transformed output, where we first transform the outputs to match the probabilities. More precisely, the MaxEntScan standard output corresponds to x = log, Pius of splico): p (not a splice site)
[00590] [00590] which, after processing == corresponds to the desired quantity. We compiled the GeneSplicer software twice, once by setting the RETURN TRUE PROB flag to O and once by setting it to 1. We chose the exit strategy that led to the best validation rate in relation to the RNA-seq data (MaxEntScan: transformed output , GeneSplicer: standard output).
[00591] [00591] To compare the validation rate and the sensitivity of the various algorithms (FIG. 38G), we found cut-off points at which all the algorithms predicted the same number of gains and losses across the genome. That is, for each cutoff point in the SpliceNet-10k A score values, we find the cutoff points at which each competing algorithm would make the same number of gain predictions and the same number of loss predictions as the SpiceNet-10k. The cutoff points chosen are given in Table S2.
[00592] [00592] We performed the validation and sensitivity analysis (as described in the sections "Sensitivity analysis" and "Validation of model predictions") separately for SNVs and singleton SNVs that appear in 2-4 GTEx individuals (FIG. 46A, 46B and 46C). To test whether the validation rate differed significantly between singleton and common variants, we carried out an exact Fisher test, comparing the validation rates in each A scoring group (0.2 - 0.35, 0.35 - 0.5, 0.5 - 0.8, 0.8 - 1) and for each expected effect (gain or loss of the acceptor or donor). After Bonferroni's correction to account for 16 tests, all P values were greater than 0.05. Likewise, we compare the sensitivity to detect singleton or common variants. We used Fisher's exact test to test whether the validation rate differed significantly between the two groups of variants. We considered deep intronic variants and variants close to exons separately and performed Bonferroni correction for two tests. None of the P values were significant using a cut-off point of 0.05. Therefore, we combine singleton and common GTEx variants and consider them together for the analyzes presented in FIGs. 48A, 48B, 48C, 48D, 48E, 48F, and 48G and FIGs. 39A, 39B and 39C.
[00593] [00593] We compared the validation rate in RNA-sec and the sensitivity of SpliceNet-10k between variants on the chromosomes used during training and variants on the rest of the chromosomes (FIG. 48A and 48B). All P values were greater than 0.05 after Bonferroni correction. We also calculated the fraction of deleterious variants separately for variants on the training and test chromosomes, as described in the section “Fraction of deleterious variants” below (FIG. 48C ). For each A scoring group and each type of variant, we use Fisher's exact test to compare the number of common and rare variants between the training and test chromosomes. After Bonferroni's correction for 12 tests, all P values were greater than 0.05. Finally, we calculated the number of cryptic splice variants again on the training and test chromosomes (FIG. 48D), as described in the section “Enrichment of new mutations by cohort”.
[00594] [00594] - We divided the variants planned for creating sites into three groups: variants that create a new splice dinucleotide GT or AG, variants superimposed on the rest of the splicing motif (positions around the exon-intron limit up to 3 nt in the exon and 8 nt in the intron) and variants outside the splice motif (FIG. 47A and 47B). For each A scoring group (0.2 - 0.35, 0.35 - 0.5, 0.5 - 0.8, 0.8 - 1), we perform a 7 test to test the hypothesis that the validation rate is uniform across the three types of splice site creation variants. All tests produced P values> 0.3 even before correcting multiple hypotheses. To compare the effect size distribution between the three types of variants, we used the Mann-Whitney U test and compared the three pairs of types of variants for each A scoring group (for a total of 4 x 3 = 12 tests) . After Bonferroni's correction by 12 tests, all P values were> 0.3.
[00595] [00595] For FIG. 39C, we wanted to test whether the rate of use of new junctions was uniform among the tissues that express the affected gene. We focused on SNVs that created new private splice sites, that is, SNVs resulting in an acquired splice joint that appeared in only at least half of the individuals with the variant and no other individual. For each new junction j, we calculate, in each tissue t, the total counts of the junction in all samples of individuals with the variant in the tissue: X; ca, t .. Here A is the sample set of individuals with the variant in the fabric t. Likewise, we calculated the total counts of all annotated gene junctions for the same X '; c1 samples, 29 gs, where g indexes the annotated gene junctions. The relative use of the new junction in tissue t, normalized against base gene counts, can then be measured as: m = EseaTis Xsesis + Doro)
[00596] [00596] We also calculate the average use of the joint between the fabrics: m = Et Eseaçjs Et Esear (1is + Eg tas)
[00597] [00597] We wanted to test the hypothesis that the relative use of the junction is uniform between the tissues and equal to m. So, do we do a 7 test comparing the observed counts of tissue X '; ea, 7, with the expected counts under the hypothesis of a uniform rate m 3'; ea, (1is + Zg19.). A variant of creating the splice site was considered tissue specific if the x Bonferroni's corrected p-value was less than 10 . The degrees of freedom for the test are T - 1, where T is the number of tissues considered.
[00598] [00598] We downloaded the Sites VCF version 0.3 file (60,706 exomes) from the ExAC browser (Lek et al., 2016) and the Sites VCF version 2.0.1 file (15,496 whole genomes) from the gnomAD browser. We have created a filtered list of variants to evaluate the SpliceNet-10k. In particular, variants were considered that met the following criteria: + The FILTER field was PASS.
[00599] [00599] A total of 7,615,051 and 73,099,995 variants passed through these filters in the ExXAC and gnomAD data sets, respectively.
[00600] [00600] For this analysis, only the variants in the filtered lists ExXxAC and gnomAD that were singleton or common were considered
[00601] [00601] For each variant, we calculate your score for the four types of splice using SpliceNet-10k. Then, for each type of splice, we constructed a 2 x 2 chi-square contingency table in which the two lines corresponded to the predicted splice change variants (The score in the appropriate range for the splice type) vs unchanged variants predicted splice (Score <0.1 for all splice types) and the two columns corresponded to singleton versus common variants. For splice gain variants, we filter those that occur at existing splice sites with annotations in GENCODE. Likewise, for splice loss variants, we consider only those that decrease the score of existing splice sites with annotations in GENCODE. The probability ratio was calculated and the fraction of harmful variants was estimated as
[00602] [00602] The chi-square contingency table 2x2 for protein truncation variants was constructed for the filtered lists ExAC and gnomAD and used to estimate the fraction of deleterious variants. Here, the two rows corresponded to protein truncation vs synonymous variants and the two columns corresponded to common vs singleton variants as before.
[00603] [00603] The results for the ExAC (exonic and near intronic) and gnomAD (deep intronic) variants are shown in FIGs. 40B and 40D respectively.
[00604] [00604] For this analysis, we focused our attention on ExAC variants that were exonic (only synonymous) or almost intronic, and were singleton or common (AF> 0.1%) in the cohort. To classify a variant of gain of the acceptor as in the frame or change of frame, we measured the distance between the canonical splice acceptor and the newly created splice acceptor and checked if it was a multiple of 3 or not. We classify donor gain variants in the same way by measuring the distance between the canonical splice donor and the newly created splice donor.
[00605] [00605] The fraction of splice gain variants in the deleterious chart was estimated from a 2 x 2 chi-square contingency table in which the two lines corresponded to splice gain variants in predicted structure (A score> 0, 8 for acceptor or donor gain) versus predicted non-altering splice variants (A score <0.1 for all types of splice) and the two columns corresponded to singleton versus common variants. This procedure was repeated for splice gain variants of switching frames, replacing the first line in the contingency table with splice gain variants with switching frames expected.
[00606] [00606] To calculate the p value shown in FIG. 40C, we built a 2 x 2 chi-square contingency table using only the predicted splice gain variants. Here, the two rows corresponded to splice gain variants in the frame versus change of frame and the two columns corresponded to singleton versus common variants as before.
[00607] [00607] To estimate the number of rare functional cryptic splice variants per individual (FIG. 40E), we first simulate 100 gnomAD individuals including each gnomAD variant in each allele with a probability equal to their allelic frequency. In other words, each variant was sampled twice independently for each individual to imitate diploidy. We count the number of rare exonic variants (AF <0.1%) exonic (only synonymous), almost intronic and deep intronic per person who had a score greater than or equal to 0.2, 0.2 and 0.5, respectively . These are relatively permissive scoring limits that optimize sensitivity, while ensuring that at least 40% of predicted variants are harmful. At these cut-off points, we obtained an average of 7.92 variants of the rare cryptic splice synonymous / almost intron and 3.03 deep intronic per person. Since not all of these variants are functional, we multiply the counts by the fraction of variants that are harmful to these cutoff points.
[00608] [00608] —We obtained newly published mutations (DNMs). These included 3953 probands with autism spectrum disorder (Dong et al., 2014; lossifov et al., 2014; De Rubeis et al., 2014), 4293 probands from the Deciphering Developmental Disorders cohort (DDD) ) (McRae et al., 2017) and 2073 healthy controls (lossifov et al., 2014). Low quality DNMs were excluded from the analysis (ASD and healthy controls: Confidence == lowConf, DDD: PP (DNM) <0.00781, (McRae et al., 2017)). DNMs were evaluated with the network and we used A scores (see methods above) to classify cryptic splice mutations, depending on the context. We considered only annotated mutations with consequences of VEP of synonymous variant, splice region variant, intron variant, 5 prime RTU variant, 3 prime RTU variant or missense variant. We use sites with A> 0.1 scores for FIGs. 41A, 41B, 41C, 41D, 41E, and 41F and FIGs. 50A and 50B, and sites with scores A> 0.2 for FIGs. 49A, 49B and 49C.
[00609] [00609] FIGS. 20, 21, 22,23, and 24 show a detailed description of the SpliceNet-80nt, SpliceNet-400nt, SpliceNet-2k and SpliceNet-10k architectures. The four architectures use flanking nucleotide sequences of lengths 40, 200, 1,000 and 5,000, respectively, on each side of the position of interest as an input and generate the probability of the position being a splice acceptor, a splice donor and none of them. The architectures mainly consist of conv convolutional layers (N, W, D), where N, W and D are the number of convolutional cores, the window size and the expansion rate of each convolutional core in the layer, respectively.
[00610] [00610] FIGS. 42A and 42B depict the evaluation of several splicing prediction algorithms in lincRNAs. FIG. 42A shows the top-k precision and the area under the precision recovery curves of various splicing prediction algorithms when evaluated on lincRNAs. FIG. 42B shows the complete transcription of the pre-mRNA for the LINCO00467 gene scored using MaxEntScan and SpliceNet-10k, together with the predicted acceptor sites (red arrows) and donor sites (green arrows) and the actual exon positions.
[00611] [00611] FIGS43A and 43B illustrate position-dependent effects of the TACTAAC branching point and GAAGAA exonic splice enhancing motifs. With reference to FIG. 43A, the optimal sequence of TACTAAC branch points was introduced at various distances from each of the
[00612] [00612] Regarding FIG. 43B, the SR GAAGAA protein hexamer motif was introduced similarly at various distances from each of the 14,289 acceptors and splice donors in the test set. The average change in the predicted scores of the SpliceNet-10k acceptor and donor is plotted as a function of the distance from the acceptor and splice donor respectively. The predicted scores increase when the subject is on the exonic side and less than —50 nt from the splice site. At greater distances in the exon, the GAAGAA motif tends to disadvantage the use of the acceptor or splice donor in question, presumably because it now preferably supports a more proximal acceptor or donor motif. The very low score of the acceptor and donor when the GAAGAA is placed in positions very close to the intron is due to the interruption of the extended splice motives of the acceptor or donor.
[00613] [00613] FIGS.44A and 44B represent effects of nucleosome positioning in splicing. With reference to FIG. 44A, in 1 million intergenic positions chosen at random, strong motives from acceptors and donors spaced at 150 nt were introduced and the probability of exon inclusion was calculated using SpliceNet-10k. To show that the correlation between the SpliceNet-10k predictions and the positioning of the nucleosomes occurs regardless of the GC composition, the positions were grouped based on the GC content (calculated using the 150 nucleotides between the introduced splice sites) and the correlation between the SpliceNet-10k predictions and the nucleosome signal is plotted for each compartment.
[00614] [00614] In relation to FLG.44B, the splice acceptor and donor sites of the test set were scored using SpliceNet-80nt (referred to as local motif score) and SpliceNet-10k, and the scores are plotted according to the enrichment of nucleosomes. Nucleosome enrichment is calculated as the nucleosome signal averaging 50 nt on the exonic side of the splice site divided by the nucleosome signal averaging 50 nt on the intronic side of the splice site. The SpliceNet-80nt score, which is a substitute for motif strength, is negatively correlated with nucleosome enrichment, while the SpliceNet-10k score is positively correlated with nucleosome enrichment. This suggests that the positioning of nucleosomes is a long-range specificity determinant that can compensate for weak local splice motifs.
[00615] [00615] FIG. 45 illustrates an example of calculating the effect size for a splice interrupt variant with complex effects. The intronic variant chr9: 386429 A> G interrupts the normal donor site (C) and activates an intronically downstream donor (D) previously suppressed. The RNA-segq coverage and junction reading counts in the individual's whole blood with the variant and a control individual are shown. The donor sites in the subject with the variant and the control subject are marked with blue and gray arrows, respectively. Bold red letters correspond to the end points of the junction. For visibility, the lengths of the exons were exaggerated four times compared to the lengths of the introns. To estimate the size of the effect, we calculated the increase in the use of the exon hopping junction (AE) and the decrease in the use of the interrupted junction (CE) compared to all other junctions with the same donor E. The size of the final effect is the maximum of the two values (0.39). An increased amount of intron retention is also present in the mutated sample. These variable effects are common in exon jump events and increase the complexity of validating rare variants that cause losses at the acceptor or donor site.
[00616] [00616] FIGS. 46A, 46B, and 46C show evaluation of the SpliceNet-10k model in singleton and common variants. With reference to FIG. 46A, fraction of cryptic splice mutations predicted by SpliceNet-10k that were validated against GTEx RNA-seq data. The model was evaluated on all variants that appear in a maximum of four individuals from the GTEx cohort. Variants with predicted effects of splice alteration were validated against RNA-seq data. The validation rate is shown separately for variants that appear in a single GTEx individual (left) and variants that appear in two to four GTEx individuals (right). The predictions are grouped by their score A. We compare the validation rate between singleton and common variants for each of the four classes of variants (acceptor or donor gain or loss) in each score group A. The differences are not significant (P > 0.05, Fisher's exact test with Bonferroni correction for 16 tests).
[00617] [00617] In relation to FiG.46B, SpliceNet-10k sensitivity in detecting splice altering variants in the GTEx cohort at different cutoff points of the A score. The sensitivity of the model is shown separately for singleton (left) and common (right) variants. The differences in sensitivity between singleton and common variants with a cutoff point of score A of 0.2 are not significant for variants close to exons or deep intronic variants (P> 0.05, Fisher's exact test with Bonferroni correction for two tests).
[00618] [00618] In relation to FIG. 46C, distribution of A-score values for validated singleton and common variants. The p-values are for the Mann-Whitney U tests comparing the scores of singleton and common variants. Common variants have significantly weaker A-score values due to natural selection that filters out mutations that cause splicing interruptions with great effects.
[00619] [00619] FIGS. 47A and 47B depict the validation rate and effect sizes of the splice site creation variants, divided by the location of the variant. The predicted variants for creating splice sites were grouped based on whether the variant created a new splice essential GT or AG dinucleotide, overlapping the rest of the splice motif (all positions around the exon-intron limit up to 3 nt in the exon and 8 in the intron, excluding the essential dinucleotide) or if it was outside the splice motif.
[00620] [00620] In relation to FlG.47A, validation rate for each of the three categories of splice site creation variants. The total number of variants in each category is shown above the bars. Within each score group A, the differences in validation rates between the three groups of variants are not significant (P> 0.3, uniformity test x2).
[00621] [00621] Regarding FIG. 47B, effect size distribution for each of the three categories of splice site creation variants. Within each score group A, the differences in effect sizes between the three groups of variants are not significant (P> 0.3, Mann-Whitney U test with Bonferroni correction).
[00622] [00622] FIGs. 48A, 48B, 49C, and 49D depict the evaluation of the SpliceNet-10k model on training and test chromosomes. With reference to FIG. 48A, fraction of cryptic splice mutations predicted by the SpliceNet-10k model that were validated against GTEx RNA-seq data. The validation rate is shown separately for variants on the chromosomes used during training (all chromosomes, except chr1, chr3, chr5, chr7 and chr9; left) and the rest of the chromosomes (right). Forecasts are grouped by their À score. We compare the validation rate between training and test chromosomes for each of the four classes of variants (acceptor or donor gain or loss) in each A score group. I | sso explains the possible differences in the distribution of the predicted A score values. between the training and test chromosomes. The differences in validation rates are not significant (P> 0.05, Fisher's exact test with Bonferroni's correction for 16 tests).
[00623] [00623] In relation to -FG.48B, sensitivity of SpliceNet-10k in detecting splice-altering variants in the GTEx cohort at different cutoff points of score A. The sensitivity of the model is shown separately for variants in the chromosomes used for training (left) ) and the rest of the chromosomes (right). We used Fisher's exact test to compare the sensitivity of the model with a cutoff point of score A of 0.2 between the training and test chromosomes. The differences are not significant for variants close to exons or deep intronic variants (P> 0.05 after Bonferroni correction for two tests).
[00624] [00624] In relation to FIG. 48C, fraction of synonymous and intronic cryptic splice variants predicted in the ExXAC data set that are deleterious, calculated separately for variants on chromosomes used for training (left) and the rest of the chromosomes (right). Fractions and P values are calculated as shown in Figure
[00625] [00625] In relation to FlG.A48D, cryptic de novo splice mutations (DNM; s) per person for DDD, ASD and control cohorts, shown separately for variants on the chromosomes used
[00626] [00626] for training (left) and the rest of the chromosomes (right). The error bars show 95% confidence intervals (CI). The number of cryptic de novo splice variants per person is lower for the test set, as it is approximately half the size of the training set. The numbers are turbulent due to the small sample size.
[00627] [00627] FIGs. 49A, 49B, and 49C illustrate de novo cryptic splice mutations in patients with rare genetic disease, only from sites in synonymous, intronic or untranslated regions. With reference to FIG. 49A, predicted cryptic de novo splice mutations (DNMs) with A> 0.2 cryptic splice per person for patients in the Deciphering Developmental Disorders (DDD) cohort, individuals with autism spectrum disorders ( ASD) of the Simons Simplex Collection and Autism Sequencing Consortium, as well as healthy controls. The enrichment in the DDD and ASD cohorts is shown above the healthy controls, adjusting to the verification of variants between the cohorts. The error bars show 95% confidence intervals.
[00628] [00628] In relation to FIG. 49B, estimated proportion of pathogenic DNMs by functional category for the DDD and ASD cohorts, based on the enrichment of each category compared to healthy controls. The proportion of cryptic splice is adjusted by the lack of missense and deeper intronic sites.
[00629] [00629] In relation to FIG. 49C, enrichment and excess of cryptic splice DNMs in the DDD and ASD cohorts compared to healthy controls at different score limits A. The excess of cryptic Ssplice is adjusted by the lack of missense and deeper intronic sites.
[00630] [00630] FIGS.50A and 50B represent cryptic splice mutations again in ASD and as a proportion of pathogenic DNMs. With reference to FIG. 50A, enrichment and excess of cryptic splice DNMs within ASD probands at different A score limits to predict cryptic splice sites.
[00631] [00631] In relation to FlG.50B, proportion of pathogenic DNMs attributable to cryptic splice sites as a fraction of all classes of pathogenic DNMs (including mutations in protein coding), using different A score limits to predict splice sites cryptic. More permissive A score thresholds increase the number of cryptic splice sites identified above the background expectation, in the balance of having a lower odds ratio.
[00632] [00632] FIG. 51 depicts the RNA-seq validation of de novo cryptic splice mutations predicted in patients with ASD. Cover junction and splice counts of RNA expression from 36 predicted cryptic splice sites selected for experimental validation by RNA-seq. For each sample, the RNA-segq coverage and junction counts for the affected individual are shown at the top and a control individual without the mutation is shown at the bottom. The graphs are grouped by validation status and type of splice aberration.
[00633] [00633] FIGS. 52A and 52B illustrate the validation rate and RNA-seq sensitivity of a model trained only in canonical transcriptions. With reference to FIG. 52A, we trained the SpliceNet-10k model using only joints from the GENCODE canonical transcripts and compared the performance of this model with a model trained in the canonical joints and splice joints that appear in at least five individuals in the GTEx cohort. We compared the validation rates of the two models for each of the four classes of variants (gain or loss of acceptor or donor) in each score group A. The differences in the validation rates between the two models are not significant (P> 0, 05, Fisher's exact test with Bonferroni correction for 16 tests).
[00634] [00634] In relation to FlG.52B, sensitivity of the model that was trained in canonical junctions in detecting splice alteration variants in the GTEx cohort at different À cut-off points. The sensitivity of this model in deep intronic regions is less than that of the model in Figure 2 (P <0.001, Fisher's exact test with Bonferroni correction). The sensitivity close to exons is not significantly different.
[00635] [00635] FIGS. 53A, 53B, and 53C illustrate that pool modeling improves the performance of the SpliceNet-10k. With reference to FIG. 53A, the top-k precision and the area under the precision recovery curves of the 5 individual SpliceNet-10k models are shown. The models have the same architecture and have been trained using the same data set. However, they differ due to the various random aspects involved in the training process, such as parameter initialization, data shuffling, etc.
[00636] [00636] In relation to FIG. 53B, the predictions of the 5 individual models of the SpliceNet-10k are highly correlated. For this study, we considered only the positions in the set of tests that were assigned an acceptor or donor score greater than or equal to 0.01 by at least one model. The subgraph (i, j) is constructed by plotting the Model * i predictions against the Model tj predictions (the corresponding Pearson correlation is displayed above the subgraph).
[00637] [00637] In relation to FlG.53C, performance improves as the number of models used to build the SpliceNet-10k set is increased from 1 to 5.
[00638] [00638] FIGS.54A and 54B represent the evaluation of SpliceNet-10k in regions with variable exon density. With reference to FIG. 54A, the test set positions were categorized into 5 compartments, depending on the number of canonical exons present in a
[00639] [00639] In relation to FIG. 54B, we repeated the analysis with MaxEntScan as a comparison. Note that the performance of both models improves with a higher exon density, as measured by the top-k accuracy and the Recovery Precision AUC, because the number of positive test cases increases over the number of negative test cases.
[00640] [00640] Candidate cryptic splice DNM; s were counted in each of the three cohorts. The DDD cohort did not report intronic DNMs> 8 nt distant from exons and therefore regions> 8 nt exon were excluded from all cohorts for the purposes of enrichment analysis to allow equivalent comparison between the DDD and ASD cohorts (FIG. 41A ). We also performed a separate analysis that excluded mutations with dual consequences of cryptic splicing processing and protein coding function to demonstrate that enrichment is not due to enrichment of mutations with protein coding effects in the affected cohorts (FIGS. 49A, 49B and 49C). The counts were scaled to determine different DNMs between the cohorts, normalizing the rate of synonymous DNMs per individual between the cohorts, using the healthy control cohort as a reference. We compared the rate of cryptic splice DNMs per cohort using an E test to compare two Poisson rates (Krishnamoorthy and Thomson, 2004).
[00641] [00641] The rates plotted for enrichment above expectation (FIG. 41C) were adjusted for the lack of DNMs> 8 nt exons, increasing to the proportion of all cryptic splice DNMs expected to occur between 9-50 nt away from exons using a trinucleotide sequence context model (see below, enrichment of de novo mutations per gene). The proportion of diagnosis only in silence and the excess of cryptic sites (FIGS. 49B and 49C) were also adjusted for the lack of sites with missense, scaling the cryptic count by the proportion of cryptic splice sites that are expected to occur in sites of missense versus synonymous sites. The impact of the A score limit on enrichment was assessed by calculating the enrichment of cryptic splice DNMs within the DDD cohort across a range of cutoff points. For each of them, the expected odds ratio observed was calculated, together with the excess of cryptic splice DNMs.
[00642] [00642] The excess of DNMs compared to the mutation rates of the reference can be considered the pathogenic yield in a cohort. We estimated the excess of DNMs by functional type in the ASD and DDD cohorts, in the context of the healthy control cohort (FIG. 41B). DNM counts were normalized to the rate of synonymous DNMs per individual, as described above. The DDD cryptic splice count was adjusted for the lack of DNMs at 9-50 nt from the introns, as described above. For the ASD and DDD cohorts, we also adjusted the missing check for deep intronic variants> 50 nt away from exons, using the proportion of quasi-intronic cryptic splice variants (<50 nt) vs deep intronic (> 50 nt) splice variants. negative selection (FIG. 38G).
[00643] [00643] We determined null mutation rates for each variant of the genome using a trinucleotide sequence context model (Samocha et al., 2014). We use the network to predict the A score for all possible single nucleotide substitutions within exons and up to 8 nt at the intron. Based on the null mutation rate model, we obtained the expected number of de novo cryptic splice mutations per gene (using score A> 0.2 as a cutoff).
[00644] [00644] According to the DDD study (McRae et al., 2017), genes were evaluated for DNM enrichment compared to chance in two models, one considering only protein truncation DNMs (PTV) and one considering all the protein alteration DNMs (PTVs, missense and indels in the table). For each gene, we select the most significant model and adjust the P-value for testing multiple hypotheses. These tests were performed since we did not consider cryptic splice DNMs or cryptic splice rates (the standard test used in the original DDD study) and once we also counted cryptic splice DNMs and their mutation rates. We report additional candidate genes that have been identified as genes with FDR-adjusted P-value <0.01 when including cryptic splice DNMs, but FDR-adjusted P-value> 0.01 when not including cryptic splice DNMs (the standard test) . Enrichment tests were performed similarly for the ASD cohort.
[00645] [00645] We selected high confidence of new of the affected probands in the Simons Simplex Collection, with at least RPKM expression> 1 RNA-seq in lymphoblastoid cell lines. We selected cryptic splice variants again for validation based on an A> 0.1 score limit for splice loss variants and A> 0.5 score limit for splice gain variants. Since cell lines needed to be acquired well in advance, these limits reflect an earlier iteration of our methods, compared to the limits we have adopted elsewhere in the article (FIG. 38G and FIG. 41A, 491B, 41Ce 41D) and the network it did not include GTEx splice joints for model training.
[00646] [00646] Lymphoblastoid cell lines were obtained from SSC for these probands. The cells were cultured in culture medium (RPM! 1640, 2mM L-glutamine, 15% fetal bovine serum) to a maximum cell density of 1 x 10º cells / ml. When the cells reached maximum density, they were passed by dissociating the cells by pipetting up and down 4 or 5 times and seeding to a density of 200,000 to 500,000 viable cells / ml. The cells were cultured under conditions of 37ºC, 5% CO »for 10 days. Approximately 5 x 105 cells were then detached and centrifuged at 300 x g for 5 minutes at 4ºC. The RNA was extracted using the RNeasyO Plus Micro Kit (QIAGEN) following the manufacturer's protocol. The quality of the RNA was assessed using the Agilent RNA 6000 Nano Kit (Agilent Technologies) and run on the Bioanalyzer 2100 (Agilent Technologies). The RNA-seq libraries were generated by the TruSegê & Stranded Total RNA Library Prep Kit with Ribo-Zero Gold Set A (Ilumina). The libraries were sequenced on HiSegq 4000 instruments at the Center for Advanced Technology (UCSF), using 150 nt single reading sequencing, with a coverage of 270 to 388 million readings (average 358 million readings).
[00647] [00647] The sequence readings for each patient were aligned with OLego (Wu et al., 2013) in relation to a reference created from hg19, replacing again patient variants (lossifov et al., 2014) with the allele corresponding alternative. Sequencing coverage, use of splice junction and transcription sites were plotted with a MISO sashimi graph (Katz et al., 2010). We evaluated the predicted cryptic splice sites, as described above in the model prediction validation section. 13 new splice sites (9 new junction, 4 exon jumps) were confirmed, as they were observed only in the sample containing the cryptic splice site and were not observed in any of the 149 GTEx samples or in the other 35 sequenced samples. For 4 additional exon jump events, low levels of exon jump were frequently observed in GTEx. In these cases, we calculated the fraction of readings that used the jump junction and found that this fraction was higher at the cryptic splice site that contains the sample compared to other samples. 4 additional cases were validated based on the prominent intron retention that was absent or much lower in other samples. The modest intron retention in control samples prevented us from resolving events on DDX11 and WDRA4. Two events (in CSAD, and GSAP) were classified as a failure in validation because the variant was not present in the sequencing readings. AVAILABILITY OF DATA AND SOFTWARE
[00648] [00648] Training and test data, prediction scores for all single nucleotide substitutions in the reference genome, RNA-seg validation results, RNA-segq junctions and source code are publicly hosted at: https: // pasespace.illumina.com/s/5Su6ThOblecrh data
[00649] [00649] RNA-seq for the 36 lymphoblastoid cell lines are being deposited in the ArrayExpress database at EMBL-EBI (www.ebi.ac.uk/arrayexpress) under the accession number E-MTAB-xxxx.
[00650] [00650] Forecast scores and source code are publicly released under a modified open source Apache License v2.0 and are free for use in academic and non-commercial software applications. To reduce circularity problems that have become a concern for the field, the authors explicitly request that the method's prediction scores are not incorporated as a component of other classifiers and instead ask that interested parties use the source code. and the data provided to directly train and improve your own deep learning models.
[0001] [0001] Table Si shows samples of GTExX used to demonstrate effect size calculations and tissue-specific splicing effects. Related to FIGs. 38A, 38B, 38C, 38D, 38E, 38F, and 38G, FIG. 39A, FIG. 39B, and FIG. 45
[0002] [0002] Table S2 shows the corresponding cut-off points for SpliceNet-10k, GeneSplicer, MaxEntScan and NNSplice in which all algorithms predict the same number of gains and losses across the genome. Related to FIG. 38G.
[0003] [0003] Table S3 shows the expected cryptic splice DNM counts in each cohort. Related to FIGs. 41A, 41B, 41C, 41D, 41E, and 41F and is produced below: exons + introns up to 8 nt Introns> 8 nt exons synonymous with new Probandos for non-normalized non-normalized for cohort (n) probing adjusted for synonym - adjusted synonym DDD 4293 0.2844 347 298.7 14 121 ASD 3953 0.2462 236 238.7 64 64.7 controls 2073 0.24747 98 98 20 20
[0004] [0004] Table4 shows the expected rates of de novo mutation by gene for each mutational category. Related to FIG. 41A, 41B, 41C, 41D, 41E and 41F.
[0005] [0005] Table S5 illustrates p values for gene enrichment in DDD and ASD. Related to FIGs. 41A, 41B, 41C, 41D, 41E and 41F.
[0006] [0006] Table S6 shows the validation results for 36 predicted cryptic splice DNMs in patients with autism. Related to FIG. 41A, 41B, 41C, 41D, 41E and 41F.
[0007] [0007] FIG. 59is a simplified block diagram of a computer system that can be used to implement the disclosed technology. The computerized system typically includes at least one processor that communicates with various peripheral devices via the bus subsystem. These peripheral devices may include a storage subsystem, including, for example, memory devices and a file storage subsystem, user interface input devices, user interface output devices and a network interface subsystem. Input and output devices allow the user to interact with the computerized system. The network interface subsystem provides an interface for external networks, including an interface for the corresponding interface devices on other computer systems.
[0008] [0008] In an implementation, neural networks like ACNN and CNN are communicatively linked to the storage subsystem and the user interface input devices.
[0009] [0009] User interface input devices may include a keyboard; pointing devices, such as a mouse, trackball, touchpad or graphics tablet; a scanner; a touch screen built into the display; audio input devices, such as speech recognition systems and microphones; and other types of input devices. In general, the use of the term "input device" should include all possible types of devices and ways of entering information into the computer system.
[0010] [0010] User interface output devices may include a display subsystem, a printer, a fax machine or displays without displaying images, such as those for audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat panel device such as a liquid crystal display (LCD),
[0011] [0011] The storage subsystem stores data constructs that provide the functionality of some or all of the modules and methods described in this document. These software modules are usually run by the processor alone or in combination with other processors.
[0012] [0012] The memory used in the storage subsystem can include multiple memories, including a main random access memory (RAM) for storing instructions and data during program execution and a read-only memory (ROM) in which the fixed instructions are stored. A file storage subsystem can provide persistent storage for program and data files and can include a hard drive, a floppy drive along with the associated removable media, a CD-ROM drive, an optical drive, or media cartridges removable. The modules that implement the functionality of certain implementations can be stored by the file storage subsystem in the storage subsystem or on other machines accessible by the processor.
[0013] [0013] The bus subsystem provides a mechanism to allow the various components and subsystems of the computerized system to communicate with each other, as intended. Although the bus subsystem is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple buses.
[0014] [0014] The computerized system itself can be of several types,
[0015] [0015] Deep learning processors can be GPUs or FPGAs and can be hosted by deep learning cloud platforms, such as Google Cloud Platform, Xilinx and Cirrascale. Examples of deep learning processors include Google's tensor processing unit (TPU), rackmount solutions such as GX4 Rackmount Series, GX8 Rackmount Series, NVIDIA DGX-1, Microsoft's Stratix V FPGA, Microsoft's FPGA Stratix V , Graphocore's intelligent processor unit (IPU), Qualcomm's Zeroth platform with Snapdragon processors, NVIDIA lap, NVIDIA DRIVE PX, NVIDIA JETSON TX1 / TX2 MODULE, Intel Nirvana, Movidius VPU, Fujitsu DPI, ARM DynamiclQ , IBM TrueNorth and others.
[0016] [0016] The previous description is presented to allow the creation and use of the disclosed technology. Several changes to the disclosed implementations will be evident and the general principles defined in this document can be applied to other implementations and requests without departing from the spirit and scope of the disclosed technology. Thus, the technology disclosed is not intended to be limited to the implementations presented, but should be given the broadest scope consistent with the principles and characteristics disclosed in this document. The scope of the disclosed technology is defined by the attached claims.

权利要求:
Claims (43)
[1]
1. Method implemented in training neural network of a splice site scorer that scores the probability of splice sites in pre-mRNA genomic sequences, the method characterized by the fact that it includes: training an atrous convolutional neural network, abbreviated ACNN, in several training examples, including at least 50,000 examples of training splice donor sites, in at least 50,000 examples of training splice acceptor sites, and at least 100,000 examples of training non-splicing sites; insert one-hot marked example target nucleotide sequences into the ACNN for training, where a target nucleotide sequence includes context of at least 200 nucleotides flanked on each side, for at least 200 upstream context nucleotides and at least 200 context a downstream; and adjust, by means of backpropagation, filter parameters in the ACNN to accurately predict, as an output, triple scores for the probability that the target nucleotide in the target nucleotide sequence is a splice donor site, a splice acceptor site or a non-splice site splicing; wherein the trained ACNN is configured to accept a nucleotide sequence of at least 401 nucleotides as input and to score at least one target nucleotide as a splice donor site, a splice acceptor site, or a non-splicing site.
[2]
2. Method implemented in neural network, according to claim 1, characterized by the fact that the entry comprises a target nucleotide sequence having a target nucleotide flanked by 2500 nucleotides on each side.
[3]
3. Method implemented in neural network, according to claim 1, characterized by the fact that the target nucleotide sequence is additionally flanked by 5000 nucleotides in the upstream context and 5000 nucleotides in the downstream context.
[4]
4. Method implemented in neural network, according to claim 1, characterized by the fact that the entry comprises a target nucleotide sequence having a target nucleotide flanked by 500 nucleotides on each side.
[5]
5. Method implemented in neural network, according to claim 1, characterized by the fact that the target nucleotide sequence is additionally flanked by 1000 upstream context nucleotides and 1000 downstream context nucleotides.
[6]
6. Method implemented in neural network, according to any one of claims 1 to 5, characterized by the fact that it additionally includes training the ACNN in at least 150000 examples of training of sp / lice donor sites, 150000 examples of training of sites splice acceptors and 800 million examples of non-splicing site training.
[7]
7. Method implemented in neural network, according to any one of claims 1 to 6, characterized by the fact that the ACNN comprises groups of residual blocks arranged in a sequence from the lowest and closest to the entrance, to the highest.
[8]
8. Method implemented in neural network, according to claim 7, characterized by the fact that each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a size of the convolution window of the residual blocks and a rate convolution of residual blocks.
[9]
9. Method implemented in neural network, according to any one of claims 7 to 8, characterized by the fact that the atrous convolution rate progresses not exponentially from a lower residual block group to a higher residual block group .
[10]
10. Method implemented in a neural network, according to any one of claims 7 to 9, characterized by the fact that the size of the convolution window varies between groups of residual blocks.
[11]
11. Method implemented in neural network, according to any one of claims 1 to 10, characterized by the fact that the ACNN additionally includes at least a group of four residual blocks and at least one connection between non-adjacent layers, in which each block residual has 32 convolution filters, convolution window size 11 and atrous convolution rate 1.
[12]
12. Method implemented in neural network, according to claim 1, characterized by the fact that the ACNN, when configured to evaluate an entry comprising the target nucleotide sequence additionally flanked by 500 upstream context nucleotides and 500 downstream context nucleotides , additionally includes: at least two groups of four residual blocks and at least two connections between non-adjacent layers, where each residual block in a first group has 32 convolution filters, convolution window size 11 and atrous convolution rate 1 and each residual block in a second group has 32 convolution filters, convolution window size 11 and atrous convolution rate 4.
[13]
13. Method implemented in neural network, according to claim 1, characterized by the fact that the ACNN, when configured to evaluate an entry comprising a target nucleotide sequence additionally flanked more by 1000 nucleotides from upstream context and 1000 nucleotides from context to downstream, additionally includes: at least three groups of four residual blocks and at least three connections between non-adjacent layers, where each residual block in a first group has 32 convolution filters, convolution window size 11 and atrous convolution rate 1 , each residual block in a second group has 32 convolution filters, convolution window size 11 and atrous convolution rate 4 and each residual block in a third group has 32 convolution filters, convolution window size 21 and convolution rate atrous 19.
[14]
14. Method implemented in neural network, according to claim 1, characterized by the fact that the ACNN, when configured to evaluate an entry comprising a target nucleotide sequence additionally flanked by 5000 nucleotides in the upstream context and 5000 nucleotides in the downstream context , additionally includes: at least four groups of four residual blocks and at least four connections between non-adjacent layers, where each residual block in a first group has 32 convolution filters, convolution window size 11 and atrous convolution rate 1, each residual block in a second group has 32 convolution filters, convolution window size 11 and atrous convolution rate 4, each residual block in a third group has 32 convolution filters, convolution window size 21 and atrous convolution rate 19 and each residual block in a fourth group has 32 convolution filters, size of convolution window 41 and atrous convolution rate 25.
[15]
15. Method implemented in neural network, according to any one of claims 1 to 14, characterized by the fact that the triple scores for each nucleotide in the target nucleotide sequence are exponentially normalized and added to the unit.
[16]
16. Method implemented in neural network, according to any one of claims 1 to 15, characterized by the fact that it additionally includes classifying each nucleotide in the target nucleotide as the splice donor site, the splice acceptor site or the non-splicing site with based on a higher score in the respective triple scores.
[17]
17. Method implemented in neural network, according to any one of claims 1 to 16, characterized by the fact that the dimensionality of the input is (C “+ L + Cº) x 4, where: C" is a number of nucleotides upstream context, C * is a number of downstream context nucleotides, and L is a number of nucleotides in the target nucleotide sequence.
[18]
18. Method implemented in neural network, according to any one of claims 1 to 17, characterized by the fact that the dimensionality of the output is L x 3.
[19]
19. Method implemented in neural network, according to any one of claims 1 to 18, characterized by the fact that the dimensionality of the entrance is (5000 + 5000 + 5000) x 4.
[20]
20. “Method implemented in a neural network, according to any one of claims 1 to 19, characterized by the fact that the dimensionality of the output is 5000 x 3.
[21]
21. Method implemented in neural network, according to any one of claims 7 to 14, characterized by the fact that each group of residual blocks produces an intermediate output through the processing of a preceding input, in which the dimensionality of the intermediate output is ( I - [(((W-1) * D) * AI) x N, where: | is the dimensionality of the preceding entry; W is the size of the residual blocks convolution window; D is the atrous convolution rate of the residual blocks; A is a number of atrous convolution layers in the group; and N is a number of convolution filters in the residual blocks.
[22]
22. “Method implemented in a neural network, according to any one of claims 1 to 21, characterized by the fact that ACNN batch evaluates the training examples during an era.
[23]
23. “Method implemented in neural network, according to any of claims 1 to 22, characterized by the fact that the training examples are randomly sampled in batches, in which each batch has a predetermined batch size.
[24]
24. Method implemented in neural network, according to any one of claims 1 to 23, characterized by the fact that ACNN iterates the evaluation of training examples over at least ten periods.
[25]
25. A method implemented in a neural network according to any one of claims 1 to 24, characterized in that the entry comprises a target nucleotide sequence having two adjacent target nucleotides.
[26]
26. Method implemented in neural network, according to any one of claims 1 to 25, characterized by the fact that the two adjacent target nucleotides are adenine, abbreviated A, and guanine, abbreviated G.
[27]
27. Method implemented in neural network, according to any one of claims 1 to 26, characterized by the fact that the two adjacent target nucleotides are guanine, abbreviated G, and uracil, abbreviated D.
[28]
28. Method implemented in a neural network, according to any one of claims 1 to 27, characterized by the fact that it additionally includes the one-hot coding of the training examples and the provision of one-hot coding as input.
[29]
29. Method implemented in neural network, according to any one of claims 1 to 6 and 15 to 28, characterized by the fact that the ACNN is parameterized by a number of residual blocks, a number of connections between non-adjacent layers and a number residual connections.
[30]
30. Method implemented in a neural network, according to any one of claims 1 to 29, characterized by the fact that atrous convolutions retain partial convolution calculations for reuse as the adjacent nucleotides are processed.
[31]
31. Method implemented in a neural network, according to any one of claims 1 to 30, characterized by the fact that ACNN comprises dimensionality-altering convolution layers that reshape the spatial dimensions and characteristics of a previous entry.
[32]
32. - Method implemented in neural network, according to any of claims 7 to 14 and 21, characterized by the fact that each residual block comprises at least one layer of standardization in batch, at least one layer of rectified linear unit (abbreviated ReLU), at least one atrous convolution layer and at least one residual connection.
[33]
33. Method implemented in a neural network, according to any one of claims 7 to 14 and 21, characterized by the fact that each residual block comprises two layers of batch normalization, two layers of ReLU non-linearity, two layers of atrous convolution and a residual connection.
[34]
34. A trained splice site scoring apparatus, characterized by the fact that it includes: several processors operating in parallel, coupled to the memory; an atrous trained convolutional neural network, abbreviated ACNN, including a plurality of convolutional layers and filters with trained coefficients, running on the various processors, trained on at least 50,000 examples of splice donor site training, on at least 50,000 training examples splice-accepting sites and at least 100,000 examples of non-splicing site training, where the training examples used in the training include nucleotide sequences of a target nucleotide flanked by at least 400 nucleotides on each side;
an ACNN entry stage that feeds an entry sequence of at least 801 nucleotides for evaluation of at least one target nucleotide, which is flanked by at least 400 nucleotides on each side, for the convolutional layers; and an ACNN exit stage following the convolutional layers which translates the analysis by ACNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site or a non-splicing site.
[35]
35. A trained splice site scoring apparatus according to claim 34, characterized by the fact that ACNN is trained on 150000 examples of splice donor sites, 150000 examples of splice acceptor sites and 800000000 non-site examples splicing.
[36]
36. A trained splice site scoring apparatus according to any of claims 34 to 35, characterized by the fact that ACNN comprises groups of residual blocks arranged in a sequence from the lowest and closest to the entrance, to the highest .
[37]
37. Trained splice site scoring apparatus according to claim 36, characterized by the fact that each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a size of the convolution window of the residual blocks and an atrous convolution rate of the residual blocks.
[38]
38. A trained splice site scoring apparatus according to claim 37, characterized in that the atrous convolution rate progresses non-exponentially from a lower residual block group to a higher residual block group.
[39]
39. A trained splice site scoring apparatus according to claim 37, characterized by the fact that the size of the convolution window varies between groups of residual blocks.
[40]
40. A trained splice site scoring apparatus according to any of claims 34 to 39, characterized by the fact that ACNN is trained on one or more training servers.
[41]
41. A trained splice site scoring apparatus according to any one of claims 34 to 40, characterized by the fact that the trained ACNN is installed on one or more production servers that receive input strings from requesting customers.
[42]
42. A trained splice site scoring apparatus according to any of claims 34 to 41, characterized by the fact that production servers process input sequences through the ACNN input and output stages to produce outputs that are transmitted to customers.
[43]
43. “Method, characterized by the fact that it includes: feeding, a trained atrous convolutional neural network, abbreviated ACNN, an input sequence of at least 801 nucleotides for evaluation that includes a target nucleotide flanked by a context of at least 400 nucleotides of each side; where the trained ACNN has been trained in at least 50,000 examples of splice donor site training, at least 50,000 examples of splice acceptor site training and at least 100,000 examples of non-splicing site training; wherein each of the training examples used in the training was a nucleotide sequence that includes a target nucleotide flanked by a context of at least 400 nucleotides on each side; and translation of analysis by ACNN into classification scores for the probability that each of the target nucleotides is a splice donor site, a splice acceptor site, or a non-splicing site.
44, System, characterized by the fact that it includes one or more processors coupled to the memory, the memory loaded with computer instructions to train a splice site detector that identifies splice sites in genomic sequences, the instructions, when executed on the processors, implement actions comprising:
train an atrous convolutional neural network, abbreviated ACNN, in various training examples, including at least 50,000 examples of splice donor site training, at least 50,000 examples of splice acceptor site training and at least 100,000 training examples non-splicing sites;
insert one-hot marked example target nucleotide sequences into the ACNN for training, where a target nucleotide sequence includes context of at least 200 nucleotides flanked on each side, for at least 200 upstream context nucleotides and at least 200 context a downstream; and adjust, by means of backpropagation, filter parameters in the ACNN to accurately predict, as an output, triple scores for the probability that the target nucleotide in the target nucleotide sequence is a splice donor site, a splice acceptor site or a non-splice site splicing.
Art. 34 12-12-2019
MODIFIED PAGES DURING THE INTERNATIONAL PCT PHASE

类似技术:

公开号 | 公开日 | 专利标题

BR112019027609A2|2020-07-21|method implemented in a neural network of training of a splice site detector that identifies splice sites in genomic sequences, trained splice site predictor and system

EP3619653B1|2021-05-19|Deep learning-based variant classifier

BR112019027480A2|2020-05-19|methods to build a classifier of variant pathogenicity and single nucleotide polymorphism, non-transitory computer-readable storage medium and system

KR102165734B1|2020-10-14|Deep learning-based technology for pre-training deep convolutional neural networks

KR102369894B1|2022-03-03|Anomaly Splicing Detection Using Convolutional Neural Networks |

KR20220031940A|2022-03-14|Aberrant splicing detection using convolutional neural networks |

同族专利:

公开号 | 公开日

IL271118D0|2020-01-30|

US20190114547A1|2019-04-18|

EP3622525A1|2020-03-18|

JP6980882B2|2021-12-15|

IL271150A|2021-07-29|

IL271115D0|2020-01-30|

KR102317911B1|2021-10-26|

AU2018350905B2|2021-12-16|

US20190197401A1|2019-06-27|

IL271115A|2021-06-30|

JP2020525887A|2020-08-27|

IL271150D0|2020-01-30|

JP2021007035A|2021-01-21|

AU2018350909B2|2021-09-23|

AU2018350907B2|2021-09-09|

IL283203D0|2021-06-30|

CN110870020A|2020-03-06|

KR20210130842A|2021-11-01|

JP2020525888A|2020-08-27|

AU2021290229A1|2022-01-20|

SG11201912746QA|2020-01-30|

AU2021282482A1|2022-01-06|

WO2019079202A1|2019-04-25|

RU2019139175A3|2021-07-27|

JP2020525889A|2020-08-27|

JP2021101338A|2021-07-08|

WO2019079198A1|2019-04-25|

EP3622519A1|2020-03-18|

AU2018350905A1|2019-12-19|

WO2019079200A1|2019-04-25|

KR20200015536A|2020-02-12|

IL271118A|2021-05-31|

AU2018350909A1|2019-12-19|

CN110914910A|2020-03-24|

CA3066534A1|2019-04-25|

IL284711D0|2021-08-31|

JP6840871B2|2021-03-17|

AU2018350907B9|2021-09-30|

EP3628099A1|2020-04-01|

CN110945594A|2020-03-31|

KR20210024258A|2021-03-04|

US20190114391A1|2019-04-18|

SG11201912781TA|2020-01-30|

KR20200010489A|2020-01-30|

JP6896111B2|2021-07-07|

RU2019139175A|2021-06-02|

KR20200010490A|2020-01-30|

AU2018350907A1|2019-12-19|

KR102223129B1|2021-03-04|

SG11201912745WA|2020-01-30|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

CA2044616A1|1989-10-26|1991-04-27|Pepi Ross|Dna sequencing|

US5641658A|1994-08-03|1997-06-24|Mosaic Technologies, Inc.|Method for performing amplification of nucleic acid with two primers bound to a single solid support|

AT269908T|1997-04-01|2004-07-15|Manteia S A|METHOD FOR SEQUENCING NUCLEIC ACIDS|

AR021833A1|1998-09-30|2002-08-07|Applied Research Systems|METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID|

US20030064366A1|2000-07-07|2003-04-03|Susan Hardin|Real-time sequence determination|

EP1354064A2|2000-12-01|2003-10-22|Visigen Biotechnologies, Inc.|Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity|

AR031640A1|2000-12-08|2003-09-24|Applied Research Systems|ISOTHERMAL AMPLIFICATION OF NUCLEIC ACIDS IN A SOLID SUPPORT|

US7057026B2|2001-12-04|2006-06-06|Solexa Limited|Labelled nucleotides|

SI3363809T1|2002-08-23|2020-08-31|Illumina Cambridge Limited|Modified nucleotides for polynucleotide sequencing|

US20040002090A1|2002-03-05|2004-01-01|Pascal Mayer|Methods for detecting genome-wide sequence variations associated with a phenotype|

US7302146B2|2004-09-17|2007-11-27|Pacific Biosciences Of California, Inc.|Apparatus and method for analysis of molecules|

GB0427236D0|2004-12-13|2005-01-12|Solexa Ltd|Improved method of nucleotide detection|

EP2620510B2|2005-06-15|2020-02-19|Complete Genomics Inc.|Single molecule arrays for genetic and chemical analysis|

GB0514910D0|2005-07-20|2005-08-24|Solexa Ltd|Method for sequencing a polynucleotide template|

US7405281B2|2005-09-29|2008-07-29|Pacific Biosciences Of California, Inc.|Fluorescent nucleotide analogs and uses therefor|

GB0522310D0|2005-11-01|2005-12-07|Solexa Ltd|Methods of preparing libraries of template polynucleotides|

EP2021503A1|2006-03-17|2009-02-11|Solexa Ltd.|Isothermal methods for creating clonal single molecule arrays|

JP5122555B2|2006-03-31|2013-01-16|ソレクサ・インコーポレイテッド|Synthetic sequencing system and apparatus|

US7754429B2|2006-10-06|2010-07-13|Illumina Cambridge Limited|Method for pair-wise sequencing a plurity of target polynucleotides|

EP2089517A4|2006-10-23|2010-10-20|Pacific Biosciences California|Polymerase enzymes and reagents for enhanced nucleic acid sequencing|

WO2012095872A1|2011-01-13|2012-07-19|Decode Genetics Ehf|Genetic variants as markers for use in urinary bladder cancer risk assessment, diagnosis, prognosis and treatment|

WO2014142831A1|2013-03-13|2014-09-18|Illumina, Inc.|Methods and systems for aligning repetitive dna elements|

AU2015318017B2|2014-09-18|2022-02-03|Illumina, Inc.|Methods and systems for analyzing nucleic acid sequencing data|CA3056303A1|2017-03-17|2018-09-20|Deep Genomics Incorporated|Systems and methods for determining effects of genetic variation on splice site selection|

US10628920B2|2018-03-12|2020-04-21|Ford Global Technologies, Llc|Generating a super-resolution depth-map|

AU2019379868A1|2018-11-15|2021-06-03|The Sydney Children’S Hospitals Network |Methods of identifying genetic variants|

US11210554B2|2019-03-21|2021-12-28|Illumina, Inc.|Artificial intelligence-based generation of sequencing metadata|

NL2023314B1|2019-03-21|2020-09-28|Illumina Inc|Artificial intelligence-based quality scoring|

NL2023312B1|2019-03-21|2020-09-28|Illumina Inc|Artificial intelligence-based base calling|

NL2023311B9|2019-03-21|2021-03-12|Illumina Inc|Artificial intelligence-based generation of sequencing metadata|

WO2020205296A1|2019-03-21|2020-10-08|Illumina, Inc.|Artificial intelligence-based generation of sequencing metadata|

NL2023316B1|2019-03-21|2020-09-28|Illumina Inc|Artificial intelligence-based sequencing|

US11151412B2|2019-07-01|2021-10-19|Everseen Limited|Systems and methods for determining actions performed by objects within images|

CN110243828B|2019-07-18|2021-07-30|华中科技大学|Biological tissue three-dimensional imaging method based on convolutional neural network|

CN110473520A|2019-07-19|2019-11-19|上海麦图信息科技有限公司|A kind of air control Chinese and English voice method of discrimination based on deep learning|

WO2021055857A1|2019-09-20|2021-03-25|Illumina, Inc.|Artificial intelligence-based epigenetics|

CN110675391A|2019-09-27|2020-01-10|联想有限公司|Image processing method, apparatus, computing device, and medium|

CN111093123B|2019-12-09|2020-12-18|华中科技大学|Flexible optical network time domain equalization method and system based on composite neural network|

CN111026087B|2019-12-20|2021-02-09|中国船舶重工集团公司第七一九研究所|Weight-containing nonlinear industrial system fault detection method and device based on data|

US20210232857A1|2020-01-28|2021-07-29|Samsung Electronics Co., Ltd.|Electronic device and controlling method of electronic device|

CN111402951A|2020-03-17|2020-07-10|至本医疗科技（上海）有限公司|Copy number variation prediction method, device, computer device and storage medium|

CN111627145A|2020-05-19|2020-09-04|武汉卓目科技有限公司|Method and device for identifying fine hollow image-text of image|

CN111798921A|2020-06-22|2020-10-20|武汉大学|RNA binding protein prediction method and device based on multi-scale attention convolution neural network|

US11074412B1|2020-07-25|2021-07-27|Sas Institute Inc.|Machine learning classification system|

CN112183718A|2020-08-31|2021-01-05|华为技术有限公司|Deep learning training method and device for computing equipment|

US11132598B1|2021-02-23|2021-09-28|Neuraville, Llc|System and method for humanoid robot control and cognitive self-improvement without programming|

CN113362892B|2021-06-16|2021-12-17|北京阅微基因技术股份有限公司|Method for detecting and typing repetition number of short tandem repeat sequence|

法律状态:
2021-11-03| B350| Update of information on the portal [chapter 15.35 patent gazette]|

优先权:

申请号 | 申请日 | 专利标题

US201762573125P| true| 2017-10-16|2017-10-16|

US201762573135P| true| 2017-10-16|2017-10-16|

US201762573131P| true| 2017-10-16|2017-10-16|

US62/573,135|2017-10-16|

US62/573,125|2017-10-16|

US62/573,131|2017-10-16|

US201862726158P| true| 2018-08-31|2018-08-31|

US62/726,158|2018-08-31|

PCT/US2018/055915|WO2019079198A1|2017-10-16|2018-10-15|Deep learning-based splice site classification|

[返回顶部]